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Foreword 


Data mining has developed rapidly and has become very popular in the past two decades, but actually has 
its origin in the early stages of IT, then being mostly limited to one-dimensional searching in databases. 
The statistical basis of what is now also referred to as data mining has often been laid centuries ago. In 
corporate environments data driven decisions have quickly become the standard, with the preparation of 
data for management becoming the focus of the fields of MIS (management information systems) and 
DSS (decision support systems) in the 1970’s and 1980’s. With even more advanced technology and 
approaches becoming available, such as data cubes, the field of business intelligence took off quickly 
in the 1990’s and has since then played a core role in corporate data processing and data management 
in public administration. 

Especially in public administration, the availability and the correct analysis of data have always been 
of major importance. Ample amounts of data collected for producing statistical analyses and forecasts 
on economic, social, health and education issues show how important data collection and data analysis 
have become for governments and international organisations. The resulting, periodically produced sta- 
tistics on economic growth, the development of interest rates and inflation, household income, education 
standards, crime trends and climate change are a major input factor for governmental planning. The same 
holds true for customer behaviour analysis, production and sales statistics in business. 

From a researchers point of view this leads to many interesting topics of a high practical relevance, 
such as how to assure the quality of the collected data, in which context to use the collected data, and 
the protection of privacy of employees, customers and citizens, when at the same time the appetite 
of businesses and public administration for data is growing exponentially. While in previous decades 
storage costs, narrow communications bandwidth and inadequate and expensive computational power 
limited the scope of data analysis, these limitations are starting to disappear, opening new dimensions 
such as the distribution and integration of data collections, in its most current version “in the cloud”. 
Systems enabling almost unlimited ubiquitous access to data and allowing collaboration with hardly 
any technology-imposed time and location restrictions have dramatically changed the way in which we 
look at data, collect it, share it and use it. 

Covering such central issues as the preparation of organisations for data mining, the role of data min- 
ing in crisis management, the application of new algorithmic approaches, a wide variety of examples of 
applications in business and public management, data mining in the context of location based services, 
privacy issues and legal obligations, the link to knowledge management, forecasting and traditional 
statistics, and the use of fuzzy systems, to summarize only the most important aspects of the contribu- 
tions in this book, it provides the reader with a very interesting overview of the field from an application 
oriented perspective. That is why this book can be expected to be a valuable resource for practitioners 
and educators. 
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Preface 


Attempts to get organizational or corporate data under control began more profoundly in the late 1960s 
and early 1970s. Slightly later on, due to management studies and the development of information so- 
cieties and organizations, the importance of data in administration and management became even more 
evident. Since then the data/information/knowledge based structures, processes and actors have been 
under scientific study. Data mining has originally involved research that is mainly composed of statistics, 
computer science, information science, engineering, etc. As stated and particularly due to knowledge 
discovery, knowledge management, information management and electronic government research, the 
data mining has been related more closely to both public and private sector organizations and govern- 
ments. Many organizations in the public and private sector generate, collect and refine massive quantities 
of data and information. Thus data mining and its applications have been implemented, for example, in 
order to enhance the value of existing information, to highlight evidence-based practices in management 
and finally to deal with increasing complexities and future demands. 

Indeed data mining might be a powerful application with great potential to help both public and pri- 
vate organizations focus on the most important information needs. Humans and organizations have been 
collecting and systematizing data for eternity. It has been clear that people, organizations, businesses 
and governments are increasingly acting like consumers of data and information. This is again due to 
the advancement in organizational computer technology and e-government, due to the information and 
communication technology (ICT), due to increasingly demanding work design, due to the organizational 
changes and complexities, and finally due to new applications and innovations in both public and private 
organizations (e.g. Tidd et al. 2005, Syvajarvi et al. 2005, Bauer et al. 2006, de Korvin et al. 2007, Burke 
2008, Chowdhury 2009). All these studies authorize that data has an increasing impact for organizations 
and governance in public and private sectors. 

Hence, the data mining has become an increasingly important factor to manage, with information in 
increasingly complex environments. Mining of data, information, and knowledge from various databases 
has been recognized by many researchers from various academic fields (e.g. Watson 2005). Data mining 
can be seen as a multidisciplinary research field, drawing work from areas like database technology, 
statistics, pattern recognition, information retrieval, learning and networks, knowledge-based systems, 
knowledge organizations, management, high-performance computing, data visualization, etc. Also in 
organizational and government context, the data mining can be understood as the use of sophisticated 
data analysis applications to discover previously unknown, valid patterns and relationships in large data 
sets. These objectives are apparent in various fields of the public and private sectors. All these approaches 
are apparent in various fields of both public and private sectors as will be shown by current chapters. 
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DATA MINING LINKED TO ORGANIZATIONAL AND GOVERNMENT 
CONDITIONS 


The data mining seen as the extraction of unknown information and typically from large databases can be 
a powerful approach to help organizations to focus on the most essential information. Data mining may 
ease to predict future trends and behaviors allowing organizations to make information and evidence- 
based decisions. Organizations live with their history, present activities, but prospective analyses of- 
fered by data mining may also move beyond the analyses of past or present events. These are typically 
provided by tools of decision support systems (e.g. McNurlin & Sprague 2006) or possibilities offered 
either by information management or electronic government (e.g. Heeks 2006, de Korvin et al. 2007, 
Syvajarvi & Stenvall 2009). Also the data mining functionalities are in touch with organizational and 
government surroundings by traditional techniques or in terms of classification, clustering, regression 
and associations (e.g. Han & Kamber 2006). Thus again, the data needs to be classified, arranged and 
related according to certain situational demands. 

It is fundamental to know how data mining can answer organizational information needs that oth- 
erwise might be too complex or unclear. The information that is needed, for example, should usually 
be more future-orientated and quite frequently somehow combined with possibilities offered by the 
information and communication technology. In many cases, the data mining may reveal such history, 
indicate present situation or even predict future trends and behaviors that allow either public policies or 
businesses to make proactive and information driven decisions. Data mining applications may possibly 
answer organizational and government questions that traditionally are too much resource consuming to 
resolve or otherwise difficult to learn and handle. These viewpoints are important in terms of sector and 
organization performance and productivity plus to facilitate learning and change management capabili- 
ties (Bouckaert & Halligan 2006, Burke 2008, Kesti & Syvajarvi 2010). 

Data mining in both public and private sector is largely about collecting and utilizing the data, analyz- 
ing and forecasting on the basis of data, taking care of data qualities, and understanding implications of 
the data and information. Thus in organizational and government perspective, the data mining is related 
to mining itself, to applications, to data qualities (i.e. security, integrity, privacy, etc.), and to information 
management in order to be able to govern in public and private sectors. It is clear that organization and 
people collect and process massive quantities of data, but how they do that and how they proceed with 
information is not that simple. In addition to the qualities of data, the data mining is thus intensely related 
to management, organizational and government processes and structures, and thus to better information 
management, performance and overall policy (e.g. Rochet 2004, Bouckaert & Halligan 2006, Hamlin 
2007, Heinrich 2007, Krone et al. 2009). For example, Hamlin (2007) concluded that in order to satisfy 
performance measurement requirements policy makers frequently have little choice but to consider and 
use a mix of different types of information. Krone et al. (2009) showed how organizational structures 
facilitate many challenges and possibilities for knowledge and information processes. 

Data mining may confront organizational and governmental weaknesses or even threats. For example, 
in private sector competition, technological infrastructures, change dynamics and customer-centric ap- 
proaches might be such that there is not always space for proper data mining. In the public sector, the 
data or information related to service delivery originates classically from various sources. Also public 
policy processes are complex in their nature and include, for example, multiplicity of actors, diversi- 
fied interdependent actors, longer time spans and political power (e.g. Hill & Hupe 2003, Lamothe & 
Dufour 2007). Thus some of these organizational and government guidelines vigorously call for better 
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quality data, more experimental evaluations and advanced applications. Finally because of the absence 
of high-quality data and easily available information, along with high-stakes pressures to demonstrate 
organizational improvements, the data for these purposes is still more likely to be misused or manipu- 
lated. However, it is evident that organizational and government activities confront requirements like 
predicting and forecasting, but also vital are topics like data security, privacy, retention, etc. 

In relation to situational organization and government structures, processes and people, the data 
mining is especially connected to qualities, management, applications and approaches that are linked to 
data itself. In existing and future organizational and government surrounding, electronic-based views 
and information and communication technologies also have a significant place. In current approach of 
data mining in public and private sectors, we may thus summarize three main thematic dimensions that 
are data and knowledge, information ‘management and situational elements. By data and knowledge we 
mean the epistemological character of data and demands that are linked to issues like security, privacy, 
nature, hierarchy and quality. The versatile information management refers here to administration of 
data, data warehouses, data-based processes, data actors and people, and applied information and com- 
munication technologies. Situational elements indicate operational and strategic environments (like 
networks, bureaucracies, and competitions, etc.), but also stabile or change-based situations and various 
timeframes (e.g. past-present-future). All these dimensions are revealed by present chapters. 


THE BOOK STRUCTURE AND FINAL REMARKS 


This book includes research on data mining in public and private sectors. Furthermore, both organiza- 
tional and government applications are under scientific research. Totally eighteen chapters have been 
divided to four consecutive sections. Section 1 will handle data mining in relation to management and 
government, while Section 2 is about data mining that concentrates on privacy, security and retention of 
data and knowledge. Section 3 relates data mining to such organizational and government situations that 
require strategic views, future preparations and forecasts. The last section, Section 4, handles various 
data mining applications and approaches that are related to organizational scenes. 

Hence, we can presuppose how managerial decision making situations are followed by both ratio- 
nal and tentative procedures. As data mining is typically associated with data warehouses (i.e. various 
volumes of data and various sources of data), we are able to clarify some key dimension of data mined 
decisions (e.g. Beynon-Davies 2002). These include information needs, seeks and usages in data and 
information management. As data mining is seen as the extraction of information from large databases, 
we still notice the management linkage in terms of traditional decision making phases (i.e. intelligence, 
design, choice and review) and managerial roles like informational roles (Minztberg 1973, Simon 1977). 
In relation to the management, it is obvious that organizations need tools, systems and procedures that 
might be useful in decision making. Management of information resources means that data has mean- 
ing and further it is such information demands of expanded information resources to where the job of 
managing has also expanded (e.g. McNurlin & Sprague 2006). 

In organizational and government surroundings, it is valuable to notice that data mining is popularly 
referred to knowledge and knowledge discovery. Knowledge discovery is about combining information 
to find hidden knowledge (e.g. Papa et al. 2008). However, again it seems to be important to understand 
how “automated” or convenient is the extraction of information that represents stored knowledge or 
information to be discovered from large various clusters or data warehouses. For example, Moon (2002) 
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has argued that information technology has given possibilities to handle information among govern- 
mental agencies, to enhance internal managerial efficiency and the quality of public service delivery, 
but simultaneously there are many barriers and legal issues that cause delays. Consequently one core 
factor here is the security of data and information. The information security in organizational and gov- 
ernment context means typically protecting of information and information systems from unauthorized 
access, use, disclosure, modification and destruction (e.g. Karyda, Mitrou & Quirchmayr 2006, Brotby 
2009). Organizations and governments accumulate a great deal of information and thus the information 
security is needed to study in terms of management, legal informatics, privacy, etc. Finally the latter has 
profound arguments as information security policy documents can describe organizational and govern- 
ment intentions with information. 

Data mining is stressed by current and future situations that are changing and developing rather 
constantly both in public and private sectors. Situational awareness of past, present and future cir- 
cumstances denote understanding of such aspects that are relevant for organizational and government 
life. In this context data mining is connected to both learning and forecasting capabilities, but also to 
organizational structures, processes and people that indeed may fluctuate. However, preparing and 
forecasting according to various organizational and government situations as well as structural choices 
like bureaucratic, functional, divisional, network, boundary-less, and virtual are all in close touch to 
data mining approaches. Especially in the era of digital government organizations simply need to seek, 
to receive, to transmit and finally to learn with information in various ways. As related to topics like 
organizational structures, government viewpoints and to the field of e-Government, thus it is probably 
due to fast development, continuous changes and familiarity with technology why situational factors are 
progressively more stressed (e.g. Fountain 2001, Moon 2002, Syvajarvi et al. 2005, Bauer et al. 2006, 
Brown 2007). In case of data mining, it is important to recognize that these changes deliver a number of 
challenges to citizens, businesses and public governments. As a consequence, the change effort for any 
organization is quite unique to that organization (rf. Burke 2008). For instance, Heeks (2006) assumes 
that we need to see how changing and developing governments are management information systems. 
Barrett et al. (2006) studied organizational change and concluded what is needed is such studies that 
draw on and combine both organizational studies and information system studies. 

As final remarks we conclude that organizational and government situations are becoming increas- 
ingly complex as well as data has become more important. Some core demands like service needs and 
conditions, ubiquitous society, organizational structures, renewing work processes, quality of data and 
information, and finally continuous and discontinuous changes challenge both public and private sectors. 
Data volumes are still growing, changing very fast and increasing almost exponentially, and are not likely 
to stop. This book aims to provide some relevant frameworks and research in the area of organizational 
and government data mining. It will increase understanding how of data mining is used and applied 
in public and private sectors. Mining of data, information, and knowledge from various locations has 
been recognized here by researchers of multidisciplinary academic fields. In this book it is shown that 
data mining, as well as its links to information and knowledge, have become very valuable resources 
for societies, organizations, actors, businesses and governments of all kind. 

Indeed both organizations and government agencies need to generate, to collect and to utilize data 
in public and private sector activities. Both organizational and government complexities are growing 
and simultaneously the potential of data mining is becoming more and more evident. However, the 
implications of data mining in organizations and government agencies remain still somewhat blurred or 
unrevealed. Now this uncertainty is at least partly reduced. Finally this book will be for researchers and 
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professionals who are working in the field of data, information and knowledge. It involves advanced 
knowledge of data mining and from various disciplines like public administration, management, informa- 
tion science, organization science, education, sociology, computer science, and from applied information 
technology. We hope that this book will stimulate further data mining based research that is focused on 
organizations and governments. 


Antti Syväjärvi 
Jari Stenvall 
Editors 
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ABSTRACT 


Although policy evaluation has always been important, today there is a rising attention for policy 
evaluation in the public sector. In order to provide a solid base for the so-called evidence-based policy, 
valid en reliable data are needed to depict the performance of organisations within the public sector. 
Without a solid empirical base, one needs to be very careful with data mining in the public sector. When 
measuring performance, several unintended and negative effects can occur. In this chapter, the authors 
focus on a few common pitfalls that occur when measuring performance in the public sector. They also 
discuss possible strategies to prevent them by setting up and adjusting the right measurement systems 
for performance in the public sector. Data mining is about knowledge discovery. The question is: what 
do we want to know? What are the consequences of asking that question? 


INTRODUCTION 


Policy aims at desired and foreseen effects. That is 
the very nature of policy. Policy needs to be evalu- 
ated, so that policy makers know if the specific 
policy measures indeed reach — and if so, how, 
how efficient or effective, with what unintended 
or unforeseen effects, etc. — these intended results 
and objectives. However, measuring policy effects 
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is not without disadvantages. The policy evaluation 
process can cause side effects. 

Evaluating policy implies making fundamen- 
tal choices. It is not an easy exercise. Moreover, 
policy actors are aware of the methods with which 
their activities — their (implementation of) policy 
— will or could be evaluated. They can anticipate 
the evaluation, e.g. by changing the official policy 
goals — a crucial standard in the evaluation process 
— or by choosing only these goals that can be met 
and avoiding more ambitious goals that are more 
difficult to reach. In this context, policy actors 
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behave strategically (Swanborn, 1999). In this 
chapter, we focus on these and other side effects 
of policy evaluation. However, we also want to 
bring them in a broader framework. 

Within the public sector, as elsewhere, there 
is the need to have tools in order to dig through 
huge collections of data looking for previously 
unrecognized trends or patterns. Within the public 
sector, one often refer to “official data” (Brito & 
Malerba, 2003, 497). There too, knowledge and 
information are cornerstones of a (post-) modern 
society (Vandijck & Despontin, 1998). In this 
context data, mining is essential for the public 
sector. Data mining can be seen as part of the 
wider process of so called Knowledge Discovery 
in Databases (KDD). KDD is the process of distil- 
lation of information from raw data, while data 
mining is more specific and refers to the discovery 
of patterns in terms of classification, problem 
solving and knowledge engineering (Vandijck & 
Despontin, 1998). 

However, before the actual data mining can 
be started, we need a solid empirical base. Only 
then the public sector has a valid and reliable 
governance tool (Bouckaert & Halligan, 2008). 
In general, the public sector is quite well docu- 
mented. In recent decades, huge amounts of data 
and reports are being published on the output 
and management of the public sector in general. 
However, a stubborn problem is the gathering of 
data about the specific functioning of specific 
institutions within the broad public sector. 

The use of data and data mining in the public 
sector is crucial in order to evaluate public pro- 
grams and investments, for instance in crime, 
traffic, economic growth, social security, public 
health, law enforcement, integration programs of 
immigrants, cultural participation, etc. Thanks to 
the implementation of ICT, recording and storing 
transactional and substantive information is much 
easier. The possible applications of data mining in 
the public sector are quite divers: it can be used in 
policy implementation and evaluation, targeting of 
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specific groups, customer-cantric public services, 
etc. (Gramatikov, 2003). 

Amajor topic in data mining in the public sector 
is the handling of personal information. The use 
of such information balances between respect for 
the privacy, data integrity and data security on the 
one hand and maximising the available informa- 
tion for general policy purposes on the other (ef. 
Crossman, G., 2008). Intelligent data mining can 
provide a reduction of the societal uncertainty 
without endangering the privacy of citizens. 

During the past decades, the functioning and 
the ideas about the public sector changed pro- 
foundly. Several evolutions explain these changes. 
Cornforth (2003, o.c. in Spanhove & Verhoest, 
2007,) states that two related reforms are crucial. 
First, government create an increasing number 
of (quasi-)autonomous government agencies in 
order to deliver public services. Secondly, there 
is the introduction of market mechanisms into the 
provision of public services. Doing so, there is 
also a raising attention for criteria such as com- 
petition, efficiency and effectiveness (Verhoest 
& Spanhove, 2007). Spurred by “Reinventing 
Government” from Osborne & Gaebler (1993), in 
the public sector too, performance measurement 
was placed more on the forefront. The idea is 
tempting and simple: a government organisation 
defines its “products” (e.g. services) and develops 
indicators to make the production of it measur- 
able. This enables an organisation — thanks to the 
planning and control cycle — to work on a good 
performing organisation (De Bruijn, 2002). In 
this way, a government can function optimally. 

The evaluation of performance within the 
public sector boosted after the hegemony of the 
New Public Management (NPM) paradigm. An 
essential component of NPM is “explicit standards 
and measures or performance” (Hood, 1996,271). 
Given the fact that direct market incentives are 
absent in government performance —as a result of 
which bad ortoo expensive performances are sanc- 
tioned by means of decreasing sale or income and 
corrective action is inevitable — the performance 
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of the public sector needs elaborate and constant 
evaluation. So, bad ortoo expensive performances 
can be steered. It is often recommended that the 
public sector needs to use, as much as possible, 
the methods of the private sector, although the 
specific characteristics of the public sector must 
be taken into account. However, the application 
within the public sector not always goes smoothly 
(Modell, 2004). 

There are a lot of reasons why one can plead 
to better evaluate the performance of the public, 
apart from NPM. One of those arguments is that 
better government policy will also reinforce the 
trust in public service. Although the empirical 
material is scarce, there are important indications 
that the objective of an increas of public trust in 
policy making and government is not reached, 
even sometimes on the contrary, if it the publica- 
tion of performance measurements is not handled 
carefully (Hayes & Pidd, 2005). 

Other reasons for more performance measure- 
ment speak for themselves. The scarce tax money 
must be applied as useful as possible; citizens are 
entitled to the best service. The attention for effi- 
ciency and effectiveness of the public service has 
been on top of the political and media agenda. For 
this reason, citizens and their political representa- 
tives ask for a maximal “return on investment”. 
Therefore, there is political pressure to pay more 
attention to measuring government policy. The 
citizen/consumer is entitled to qualitative public 
service. 

Measuring government performances, a boom- 
ing business, is not an obvious task. What is, for 
example, effectiveness? Roughly and simple 
stated, effectiveness is the degree in which the 
policy output realizes the objectives — desired 
effects (outcome) — independent from the way 
that this effect is reached. That means that many 
concepts must be filled in and be interpreted. As 
a result, effectiveness could become a kind of 
super value, which includes several other values 
and indicators (Jorgenson, 2006). The striving 
towards “good governance” also encompasses a 


lot of interpretations, which refers to normative 
questions (Verlet, 2008). These interpretations and 
others of for example efficiency, transparency, 
equity, etc. are stipulated by the dominating po- 
litical climate and economic insights, and by the 
broader cultural setting.’ “Good governance” is 
a social construction (Edwards & Clough, 2005) 
without a strong basis in empirical research. 
Indicators for governance seem — according to 
Van Roosbroek — mainly policy tools, rather than 
academic exercises. (Van Roosbroek, 2007). 

There are many studies about government 
performance, from which policy makers want 
to draw conclusions. For this reason all kinds 
of indicators and rankings see the light, which 
compare the performances of the one public 
authority to another. Benchmarking then is the 
logical consequence. How such international and 
internal rankings are constructed is often unclear. 
Van de Walle and others analysed comparative 
studies. Their verdict is clearly and merciless: the 
indicators used in those rankings generally mea- 
sure only a rather limited part of the government 
functioning, perceptions of the functioning had to 
pass for objective measurements of performance. 
The fragmentation of the responsibility for col- 
lecting data is an important reason for the insuf- 
ficient quality of the used indicators. As a result, 
comparisons are problematic. Hence, they stress 
the need for good databases that respect common 
procedures and for clear, widely accepted rules 
about the use and interpretation of such data. 
These rules shoud enable us to to compare policy 
performances in different countries and so to learn 
from good examples. The general rankings contain 
often too much subjective indicators, there are 
few guarantees about the quality of the samples 
and that there are all to often inappropriate ag- 
gregations (Van de Walle, Sterck, Van Dooren & 
Bouckaert, 2004; Van de Walle, 2006; Luts, Van 
Dooren & Bouckaert, 2008). 

An important finding based on those meta- 
analysis is that when it comes down to the public 
sector, there is a lack on international comparable 


data enabling us to judge the performance in 
terms of, among others, efficiency and effectiv- 
ity, besides other elements of “good” governance. 
Although such comparisons can be significant, 
they say little about the actual performance of the 
public sector in a specific country. Their objec- 
tives and contexts are often quite different. They 
sometimes stress to much some specific param- 
eters, such as the number of civil servants, and 
they fail to measure the (quality of the) output/ 
outcome of public authorities sufficiently. The 
discussion about the performance of the public 
sector is however an inevitable international one, 
which among other things, was reinforced by the 
Lisbon-Agenda. In 2010 the EU must be one of 
the most competitive economic areas (Kuhry, 
2004). One important instrument to reach this is 
a “performance able government”. 

This attention for the consequences of mea- 
suring the impact of government policy is not 
new. Already in 1956, Ridgway wrote about the 
perverse and unwanted effects measuring govern- 
ment performances can have. There are some more 
recent studies about it. Smith (1995) showed that 
there is consensus about the fact that performance 
measurement can also have undesirable effects. 
Moreover, those undesirable effects also have a 
cost, which is frequently overlooked when estab- 
lishing measurement systems (Pidd, 2005a). But 
the attention to the unforeseen impact of policy 
evaluations remains limited. It is expected that 
this will change in the coming years, because of 
increased attention for evaluation. The evaluation 
process itself will more and more be evaluated. 

The current contribution consists of three 
parts. In the second paragraph, we discuss the 
general idea of the measurement of performance 
of governments. In the third paragraph we go into 
some challenges concerning the measurement 
of government policy and performance. In the 
fourth and final part, we focus on the head subject: 
which negative effect arise when measuring the 
performance of the public sector? We also discuss 
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several strategies to prevent negative effect when 
measuring performance in the public sector. 

This contribution deals with questions that rise 
and must be solved before we begin the data min- 
ing. The central focus is on the question what kind 
of information is needed and accurate to evaluate 
government performance and on how me musttreat 
that information. Before the mining can begin, 
we need to be sure that the data could deliver us 
where we are looking for. Data mining is about 
knowledge discovery. The question is: what do 
we want to know? What are the consequences of 
asking that question? Does asking that question 
has an influence on the data that we need in order 
to give the answer? 


MEASURING PERFORMANCE 
IN THE PUBLIC SECTOR 


The objective is clear: to depict the performance 
of actors within the public sector. But what is 
“performance”? It surely is a multifaceted con- 
cept that includes several elements. That makes 
it cumbersome to summarise performance in 
one single indicator. Also the relation between 
process and outcome is important (Van de Walle 
& Bouckaert, 2007). Van de Walle (2008) states 
we cannot measure performance and effectiveness 
of the government only by balancing outputs and 
outcomes with regard to certain objectives. This is 
because objectives of governments are generally 
vague and sometimes contradictory. The govern- 
ment is a house with a lot of chambers. Given the 
fact that most policy objectives are prone to several 
interpretations, plural indicators are required. The 
relation between the measured reality and the 
indicators used is frequently vague. Effects are 
difficult to determine. And even if it is possible 
to measure them, it still simple is quit difficult to 
identify the role of the government in the bringing 
about the effects in a context with a lot of actors 
and factors (De Smedt el al., 2004). At all this, we 
also must distinguish between deployed resources 
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Figure 1. The production process and public sector 
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(input), processes (throughput), products (output) 
and effects (outcome). It is self-explanatory that 
we had to bear in mind the specific objective(s) 
and the context in which the evaluation takes place. 

The evaluation of performance can only be 
done well if there is sufficient attention for the 
complexity of the complete policy process. We 
can represent the production process in the public 
sector as in Figure 1(OECD, 2007, 16). 

At the centre of the production process are 
efficiency and effectiveness. What are those con- 
cepts about? Efficiency indicates the relation 
between the deployed resources (input) and the 
delivered products or service (output) (I/O). Pro- 
ductivity is the inverse of efficiency. Efficiency 
indicates the quantity of input necessary per unit 
of output, whereas productivity is a criterion to 
quantify the output that one can realise per unit 
input (O/I). Effectiveness refers to the cause and 
consequence relation between output and outcome. 
Does policy had the aimed effect (within the 
postulated period)? To what extentare there desired 
or undesirable side effects? In short: efficiency is 
about doing things right, while effectivity is about 
doing the right things. 

Along the input side for policy evaluation, it 
is essential to get a clear picture of the several 
types of resources. Along the output side, the 
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problem for the public sector is that its services 
generally are not available on the free market, 
so it is generally quit difficult to calculate their 
(market) value. For this reason, physical product 
indicators frequently are used which are an (in) 
direct measure for production. This opens a lot of 
choices, and a lot of data to work with. In these 
tasks, data-mining could be of great assistance. 
Contrary to output, it is often not easy to 
attribute outcomes to actions performed by the 
government (Hatry et al., 1994). Several other 
(factors, outside the control of a government, 
can play a role. What is the part of government 
actions in the coming about of desired outcome, 
what is the part of other actions and actors?? Do 
we need to measure output or outcomes? Policy 
evaluation research involves therefore a thorough 
study ofall possible cause/consequence relations. 
Information alone is not sufficient. It is as- 
sumed in traditional evaluation research that the 
efficiency shows itself by balance input and out- 
put against each other. However, that gives little 
information about the causal link between both. 
Using the words of Pawson and Tilley (1997) a 
“realistic evaluation” is not obvious. Besides, ef- 
ficiency and effectiveness are only two criteria. 
Other criteria are also important when evaluat- 
ing the public sector: legal security, legitimacy, 
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Figure 2. The performance measurement and its possible consequences 
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equity, transparency, accountability, etc. (Smith, 
1995). Efficiency and effectiveness are specific 
aspects of “good governance”. Although today 
much emphasis is particularly laid on efficiency 
and effectiveness, good governance is far more 
than that (Verlet, 2008). The over-emphasis of 
efficiency and effectiveness takes away the vis- 
ibility on other values and criteria. 

When speaking of policy or performance 
measurement systems, we must distinguish two 
dimensions: the conditions and the consequences. 
The conditions are related to the design and the im- 
plementation of the measurement system, whereas 
the consequences are related to the results of the 
functioning of such a system. The consequences 
can be internal and external. Internal consequences 
are for example changes in attitudes of employees, 
increase of the efficiency and changes in the as- 
signment of resources. External changes situate 
themselves outside the organisational borders and 
refer to e.g. changes in the perception of citizens 
and changes in the societal setting. These concepts 
are brought together in the overview mentioned 
in Figure 2 (Hiraki, 2007, 5). 

Before discussing the central question on the 
undesirable effects of evaluating government 
policy, we first deal with some particular issues 
of the process of public policy evaluation. 
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On Which Level Do We Measure the 
Performances of the Government? 


We can distinguish between the analyses of gov- 
ernment performance at three levels: the macro, 
meso or micro level (Callens, 2007). This is related 
to the objective of the analysis: do we want to 
analyse the production process of the government 
entirely (macro), in a specific government sector 
(sectoral) ora specific service to end users (micro)? 

On first sight, the idea of an overall index is 
very interesting. Such an index could allow us, 
for example, to compare the position of Flanders 
with a number of regions or countries in order 
to make a ranking. Callens (2007) reports four 
examples of such an overall index, more specific 


‘the rankings produced by the European Central 


Bank, the Institute for Management Development, 
the World Economic Forum and the World Bank. 

The main problem with making such general 
performance indicators is one of aggregation. The 
complexity of a government can not be reduced 
in a single indicator. Such a general indicator 
insufficiently takes into account for example the 
administrative culture, the differences in the state 
structures, et cetera. 

In a so-called sectorial study, one compares 
for example the efficiency and the effectiveness 
of a specific sector in a country or region with 
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these of that sector in other countries or regions. 
A classic example is the research from the Neth- 
erlands Institute for Social Research about the 
performance of the public sector (Kuhry, 2004). 
The aim ofthis study was to analyse the differences 
in productivity, quality and effectiveness of the 
services organised by the government between the 
Netherlands and other developed countries. They 
studied four policy fields: education, health care, 
law and order and public administration. Besides, 
the OECD is also very active in the field of sec- 
tor studies. (For an overview: OECD, 2007, 38). 

When using a micro-approach, one compares 
specific public services. For example there are 
comparative studies of fire services, hospitals, 
schools, prisons, courts, nursery and services 
concerning registry of births, deaths and mar- 
riages (e.g. Bouckaert 1992; Bouckaert 1993; 
Van Dongen, 2004). 


Do We Measure Perception 
or Reality? 


The problem with indicators is that they frequently 
do wrong to the complexity of social choices which 
underpin the policy. It’s crucial to keep actual 
performance and the perception of performance 
separated. Unfortunately, we currently lack good 
general measurement systems of actual perfor- 
mance of the public sector which allow for useful 
comparisons between governments (Van de Walle 
& Bouckaert, 2007). Therefore, in many studies 
government policy users — citizens — are asked 
what they think or feel of the public services. 
Perception becomes important. We must be very 
careful with subjective indicators. 

One of the reasons therefore is that a negative 
attitude of the population towards governmentcan 
lead toa negative perception of the performance of 
that government. This attitude has possibly more 
to do with the general cultural context, than with 
the government in question (Van de Walle, Sterck, 
Van Dooren & Bouckaert, 2004). So we must take 


into account that expectations can influence the 
perception to an important degree. 


Which Value Indicators Can 
be Used for the Measurement 
of Public Service? 


In the market sector, the production volume can 
be inferred easily from the market value of the 
goods or services in question. Time series can be 
constructed, taking into account the price index. 
This way, one can develop value indicators. Ser- 
vices produced by the public sector, are generally 
not negotiated on the free market. Therefore, their 
market value is not known. The value of this type 
of production cannot be expressed in money. For 
this reason, in most cases physical product indica- 
tors are used (Kuhry, 2004). 

This is a generic term, which is related to sev- 
eral types of indicators, which can be considered 
as direct or indirect measures for production. We 
can be distinguished between: 


e Performance indicators. These indicators 
are related to the provided end products, 
e.g. the number diplomas delivered by an 
education institution. 

° User indicators. These indicators are re- 
lated to the consumers of the services, e.g. 
the number of students. 

e Process indicators. These indicators con- 
cern the performed activities or intermedi- 
ary products, e.g. the number of teaching 
hours. 


The problem remains that it is very difficult 
to measure purely collective goods/services. An 
alternative is to proportion the deployed resources 
to the GDP (Kuhry, 2004). Not only the volume, 
but especially the measurement of the quality of 
the government policy is quite difficulty. What is 
the quality of defence? There is a large dispute 
about the definition of quality (Eggink & Blank, 
2002). Those authors depict a possible trade-off 


between quality and efficiency. Ifquality is not suf- 
ficiently reflected in the standards of production, 
then low quality norms can qualify themselves 
as very efficient. However, this trade-off is not a 
regularity, efficiency and quality can go together 
(Van Thiel & Leeuw, 2003; cf. infra). 


Which Quality Guidelines 
Can be Used? 


Which, well-defined quality guidelines had to 
be used to analyse the data to incorporate in the 
evaluation research? Examples of such criteria can 
be found in the research done by the Netherlands 
Institute for Social Research (cf. supra), Eurostat 
or OECD. In any case, it is crucial to use qualita- 
tively good data if we want to build a reliable and 
valid measurement instrument. Besides, the qual- 
ity of data is a crucial factor when talking about 
the (possible) negative effects of performance 
measurement systems (cf. infra). 


EFFECTS OF PERFORMANCE 
MEASUREMENT 


Introduction and the Good Side 
of Performance Measurement 


In this section we deal with the core of this chap- 
ter and focus on the unintended and ‘perverse’ 
or negative impact of policy evaluation. More 
specific we deal with the impact of measuring 
performance. When analysing performance 
measurement, we see a predominating output 
orientation, although from a policy point of view it 
might be more interesting to focus on the eventual 
impact of the government action (outcome). As 
noted, measuring outcome is difficult, especially 
the attribution of the role of the different actors. 
Hence, the attention goes out to output, which is 
more easy measurable and to which corrective 
action is easier (De Bruijn, 2002). Hereafter, like 
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in most literature, we focus on the measurement 
of output. 

Within the vast literature on performance 
measurement and performance measurement 
systems, the number of contributions on negative 
or perverse impact of performance measurement 
is rather limited. However, the attention for these 
effects is not new or unknown see for example the 
analysis of Ridgway (1956). We share the impres- 
sion of Pidd (2005a) that just like performance 
measurement itself, the existence of such perverse 
effects appears to be unconventional. It seems to 
be inherent to and accepted in the performance 
measurement, as if they are unavoidable. However, 
perverse effect has direct and indirect costs that 
often are not taken into account. In some sectors 
they are more taken into consideration than in 
others. Within a number of specific policy sec- 
tors, such as the health care, we find relatively 
much attention to the unintentional impact of 
measurement systems (Brans e.a., 2008). For a 
more general analysis of the problem, we can 
refer to the work of De Bruin (2002 and 2006) 
and Smith (1995). 

Not only sector specific characteristics are 
relevant, it’s obvious that the communication of 
performance measurementresults is also important 
to explain the relevance and effects that perfor- 
mance measurement can have. It is thus essential 
to bear in mind both internal (e.g. regarding em- 
ployees) as external (e.g. regarding the general 
public) communication (Garnet, et al., 2008). The 
communication itself can generate impact which 
is linked to performance measurement. 

Although our attention goes out to the negative 
impact of performance measurement as to strate- 
gies to reduce this, it is obvious that performance 
measurement has also positive effects. As noted, 
this chapter is not against evaluation or perfor- 
mance measurement. Evaluation research can 
contribute to the steering of the behaviour of policy 
agencies (Swanborn, 1999). De Bruijn (2002) 
reports four functions of performance measure- 
ment. In the first place performance measurement 
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contributes to transparency. It allows organisations 
to offer clarity concerning the products or services 
which they offer and the resources which they use 
torealise them. Secondly, an organisation can learn 
on the basis of performance measurement what 
is good and what should be improved. Thirdly, 
such a measurement allows for a judgment of the 
(administrative) functioning of an organisation, 
what contributes to better management because 
there is more objective and explicit accountability. 

In analysis about performance measurement 
impacts Smith (1995) notes that the impact shows 
itselfon the internal management of organisations 
within the public sector, also when evaluations 
are clearly aimed at external stakeholders (e.g. 
citizens). Therefore, we focus on the impact of 
performance measurement on the organisation 
itself, and not for example on modifying attitudes 
of citizens towards those organisations. 


Negative Effects of 
Performance Measurement 


In this era where measurement, consulting and 
evaluation is popular and big business, the nega- 
tive impact of performance measurement gets little 
attention. In his analysis Smith (1995) detected 
eight unwanted (negative) effects or dysfunc- 
tions of performance measurement. According 
to Smith, they are mainly the result of a lack of 
congruence between the objectives of the agents 
and the objectives of principals. A same reasoning 
can be found in the work of De Bruijn (2006), 
who draws attention to the tensions between the 
professional (those who are active in the primary 
process) and the manager (who eventually wants 
to steer based on the performance measurement). 
A more general analysis of the (negative) impact 
of performance measurement can be found in the 
work of Bouckaert & Auwers (1999) and Van 
Dooren (2006). 

The degree, in which specific (positive and 
negative) effects manifest themselves, is strongly 
depending on the structure and culture ofa specific 


organisation. Also the quality of the indicators 
underpinning the performance data is essential 
(Brans et al., 2008). We discuss the effects in 
case of measurement. Sometimes, there is no 
measurement at all, because of the negative at- 
titude towards such measurement or because of 
the expected negative effects. In this respect, the 
lack of such a measurement system can be seen as 
anegative effect as such. What are the most noted 
negative effects of performance measurement? 


A Too Strong Emphasis on 
the Easily Quantifiable 


In performance measurement systems, the em- 
phasis often is on quantifiable phenomena. As a 
consequence, management also will have espe- 
cially attention for quantifiable processes, at the 
expense of aspects of government policy that are 
not or less easy quantifiable. This is caused by the 
difficulty and disputes concerning the definition 
of quality and/or changes of the interpretation of 
it (Eggink and Blank, 2002). 

Smith (1995) wrote in this context off a “tun- 
nel vision”. He gave the example of the health 
care in the UK. In that case, the strong emphasis 
on prenatal mortality rates led to changes in the 
nature of the service on maternity services, at the 
expense of not-quantifiable objectives. De Bruijn 
(2002) also refers to this problem by indicating that 
performance measurement potentially dissipates 
the professional attitude, by focussing on quantity 
in measuring the performance of especially mea- 
surable and easily definable aspects. The example 
which he quotes is that of museums, where a too 
strong focus on easily measurable data — such as 
the number of visitors — dominates other indica- 
tors and considerations, such as the artistic value 
of a collection. 

This problem can be explained by the diver- 
gence between the objectives of an organisation 
and the measurement system. It is specific for 
the public sector. Characteristic for the public 
sector is that a whole range of objectives must 


be realised and that a lot of important objectives 
are reasonably difficult to quantify. In addition, 
objectives of organisations within the public sector 
reach frequently much further then the direct aim 
of the provision of services. For example, educa- 
tion must transfer not only more easy measurable 
knowledge and skills, but also attitudes, norms 
and values, et cetera. 

Mostly, it is very difficult to inventory all 
activities and objectives. As a consequence, the 
importance attached to performance measurement 
of objective data can be reduced, while values can 
be more stressed. This requires a fitting policy 
culture. Policy measurement is not only about 
numbers and figures, it is also has to do with a 
specific normative view on what the public sector 
needs to be and do. 


Too Little Attention to the Objectives 
of the Organisation as a Whole 


This second problem is what Smith (1995) 
called sub-optimization. Actors responsible for 
a specific part of the broader organisation tend 
to concentrate on their particular objectives, at 
the expense of the objectives of the organisation 
as a whole. Especially for the public sector this 
is a severe problem, since a lot of policy entities 
are involved in the realisation of objectives. De 
Bruijn (2002) refers to this problem if he states 
that performance measurement can hamper the 
internal interchange of available expertise and 
knowledge. For example, the introduction of 
performance measurement in schools had a bad 
influence on the cooperation and mutual under- 
standing between the schools in question. 

Much depends on the type of activities of an 
organisation within the public sector. In addition 
the central government can avoid this problem, 
to a certain extent, by a good harmonisation 
between the different sections, e.g. by means of 
general service charters that are translated into 
more operational charters (Verlet, 2008). 
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Too Much Attention for Short- 
Term Objectives 


This problem also descends from the possible 
differences between the objectives of an organi- 
sation and what can covered by the performance 


„measurement system. Smith called this problem 


myopia. It concerns a short-sighted view, in the 
sense that one pursues short-term objectives 
at disadvantage of legitimate objectives on the 
long run. Performance measurement is mostly 
only a snapshot in time. Activities can produce 
large advantage in the long term, but that is not 
always noticeable in the measurement system. 
Many performance measurement systems don’t 
give us a picture of the performance over a longer 
period, nor of future (anticipated) consequences 
of current management action. 

This effect is reinforced when executive staff 
and employees hold functions fora shorter period. 
Of course, also in this case the degree in which 
this problem arises depends strongly on the types 
of public services, the culture and the structure 
of an organisation. A way to handle this specific 
problem in performance measurement is having 
attention for processes concerning topics on a 
longer period, rather than solely measuring output. 


A Too Strong Emphasis on 
Criteria for Success 


This impact is what Smith (1995) called measure 
fixation. Spurred by performance measurement, 
an organisation feels inclined to overemphasis the 
criteria on which they will be judged. In this con- 
text Brans et al. (2008) pointed that performance 
measurement can lead to ritualism. This means 
that one tries to score well on the key indicators 
in order to satisfy interested parties. In this con- 
text, several authors refer to the concept of Power 
(1999) which deals with disengagement, a false 
impression of things: the representation doesn’t 
correspond with the reality. 
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A too strong emphasis on criteria for success 
originates from the incapacity ofa lot of measure- 
ment systems to map complex phenomena. Smith 
gave an example of reducing the waiting times 
in the health care, more specifically the objective 
that patients should wait no longer then two years 
for a surgical intervention. This had as unforeseen 
effect that the number of patients that had to wait 
for one year increased and that the initial intake 
of patients happened at a later moment. Patients 
arrived later on the waiting list. 

A possible solution for this problem is to 
increase the number of criteria in order to assess 
the functioning of an organisation. However, we 
need to take into account that this can blur the 
focus and can lead to demoralizing. An alterna- 
tive solution which Smith (1995) suggests, is the 
recognition that most measurements are proxies 
for output and that the ultimate arbitrators of the 
quality of the output are the customers of the or- 
ganisation. In order to make this happen, we need 
a clear picture of who those customers are and 
what their expectations and needs are. Moreover, 
we must keep in mind that the perception of the 
functioning of an organisation is not necessarily 
a good indicator for its actual functioning (Verlet, 
Reynaert & Devos, 2005). 


Misrepresentation of Performance 


This effect refers to intentional manipulation of 
data. As a result, reported behaviour does not cor- 
respond with actual behaviour. It is self-evident 
that the incentive to use these reprehensible prac- 
tices is the largest when there is a strong emphasis 
on performance indicators. The possibilities for 
a wrong reproduction of the performance are of- 
ten high in the public sector (Smith 1995). This 
because the organisations in question frequently 
supply the data and indicators needed for the 
evaluation of their own performance (or lack of 
it). Here too we can refer to the difficulty to map 
complex phenomena precise and reliable, so data- 
mining could be a solution. Possible problems 


can occur during aggregating or disaggregating 
data on performance (Van Dooren, 2006). Smith 
(1995) made a distinction between two types of 
misrepresentation: creative reporting and fraud. 
The difference between both is sometimes diffi- 
cult. This shortcoming of a wrong reproduction 
of the performances can be reduced by (internal 
and external) audit and with introduction of the 
possibility for sanctions when misrepresentation 
comes on the track. 


Poor Validity and Reliability 


Under this denominator we include the effects 
which Bouckaert and Auwers (1999, 77) consid- 
ered as pathologies referring to the false percep- 
tions of volume and numbers. More specific, 
they discuss convex and concave measurement 
instruments, when respectively higher and lower 
values are noted compared to reality. It is clear 
that these problems do not originate particularly 
from the tension between managers/professional 
within an organisation, but are due to the very 
measurement as such. 


Wrong Interpretations 


The production process of public services is mostly 
quite complex. Moreover, actors themselves have 
to operate in a complex environment. Therefore 
even if it is possible to map performances per- 
fectly, it is still not obvious to translate the signals 
in the data. It speaks for itself that these wrong 
interpretations are a real problem when using 
performance indicators. 

The performances of several organisations are 
frequently compared with each other. However, 
this is not self-evident, because they might have 
very different objectives, resources, institutional 
and cultural contexts, et cetera. Correctly han- 
dling performance data is a skill. By restricting 
the number of indicators, one can counteract 
slightly the problem of the wrong interpretation 
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of data, although this can in itself generate other 
perverse effects. 


Gaming 


This negative effect of performance measurement 
concerns intentional manipulating behaviour to 
secure strategic advantage. Whereas misrepre- 
sentation is about the reported behaviour, gaming 
is about the manipulation of actual behaviour. In 
the work of De Bruijn (2006) we find an example 
of how performance measurement can lead to 
strategic behaviour. It has to do with performance 
measurement of a service within the Australian 
army, which must provide housing to soldiers who 
have been stationed far from home. The perfor- 
mance indicator used is the number of soldiers that 
agreed with housing after maximum three offers. 
After introducing this indicator, quite soon the full 
100% of the soldiers agreed after maximum three 
offers. The explanation was simple. The service 
first informally offered housing to the soldiers. 
Only when the employees of the service were 
rather certain that the soldier would agree with 
the offer, they did the formal offer. It is a matter 
of strategic behaviour: the performances are only 
on paper, the societal meaning of it is limited. 

How to reduce gaming? In the first place 
one can, according to Smith (1995), counteract 
gaming by taking into account a broad pallet of 
performance indicators. Other possibilities are 
benchmarking or offering executive managers 
career perspectives on a shorter term. This can 
lead however to myopia (cf. supra). 


Petrifaction 


Petrifaction or fossilization refers to the discour- 
agement of innovation because of a too rigid 
measurement. A lot of performance measurement 
systems have the inclination to reward constantly 
reproducing the existing. The need to select on 
advance performance indicators and objectives, 
can contribute to the blindness for new threats 
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and possibilities. In the same context gaming can 
germinate. The danger of petrifaction originates 
from inevitable time lag between setting up 
performance measurement and the possibility/ 
difficulty to adjust the measurement system. By 
consequence, a kind of meta control mechanism 
is necessary in order to safeguard the adequacy 
of performance indicators. 

This petrifaction of organisations can be 
thwarted by providing incentives for anticipating 
new challenges and innovative behaviour, even 
if these activities do not contribute directly to the 
current performance indicators. 


Reinforces Internal Bureaucracy 


Performance measurement needs time and re- 
sources. As it happens, a sound performance mea- 
surement demands a precise recording of inputs, 
processes, outputs, outcomes and additionally 
takes into account the ever changing surrounding 
factors (cf. supra). It speaks for itself that such a 
measurement demands extra resources and people 
of the public administration. Gathering, providing, 
analysing, constructing, interpreting ... the needed 
data are sometimes quite complex and demanding. 
They generate the need for a specific department, 
making the administrative process more complex. 


Hamper Ambitions/Cherry Picking 


That performance measurement possibly ham- 
pers the ambitions of an organisation, originates 
from the fact that organisations can force up their 
performance, for example in terms of output, by 
optimising the input. More specific, one can choose 
to select the input in such a way so that these re- 
quire minimal throughput. In this context one can 
talk about “cherry picking”. For an example we 
can refer to education, where a school can better 
its output (e.g. in terms of percentage succeeded 
students), by using strict selection criteria for al- 
lowing students (De Bruijn, 2006). 
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This negative side effect also can be called 
“cream-skimming”, skimming the target group 
by especially addressing to subgroups which are 
easy to reach. According to Swanborn (1999) this 
effect can also manifest itself by self-selection 
from the target group. Under this denominator we 
can also mention the negative impact polarisation 
(Van Dooren, 2006). The situation that this author 
outlines is one in which certain forms of service or 
files are ricocheted because they are considered as 
hopeless in view of putting up standards. Rather 
then investing in these problem cases, one can 
opt for to ricochet these. 


Can Punish Good Performance 


Although performance measurement systems 
should stimulate good performance, it can also 
punish good performance. Describing this nega- 
tive effect, De Bruijn (2002) refers to the work 
of Bordewijk and Klaassen (2000) who indicated 
that investing in transparency and efficiency is 
not without risks. Organisations which invest in 
transparency can possibly be sanctioned by means 
of a budget reduction. Control and transparency 
could learn that equal performance can be realised 
with less resources. A similar organisation which 
' does not invest in transparency and efficiency is 
rewarded with the same budget for equal perfor- 
mance. 


A GENERAL EXPLANATION OF 
NEGATIVE EFFECTS AND GLOBAL 
STRATEGIES TO PREVENT THOSE 


According to Smith and De Bruijn, negative im- 
pact finds its origin in a mismatch between the 
objectives of principals/management on the one 
hand and those of the agents/professionals on the 
other. De Bruijn (2002, 2006) sees two general 
reasons behind the negative effects of performance 
measurement. 


In the first place, he states that professionals 
could pervert the performance measurement sys- 
tems and that they consider themselves legitimised 
to do so. This has several reasons. First of all, 
they consider performance measurement as poor 
measurements because - certainly in the public 
sector —there is a trade-off between several (com- 
petitive) values. Public performances are plural 
and this is not always reflected in the measurement 
systems. Moreover, a lot of professionals consider 
performance measurement as unfair, because 
they don’t give sufficient account to the fact that 
performances are in many cases the result of co 
production. A third and last legitimating ground 
is the opinion that performance measurement 
is in it mostly static, whereas performances are 
dynamically of nature. 

A second general reason is that the more 
managers want to steer by performance measure- 
ment, the less effectively performance measure- 
ment will be. De Bruijn (2006) talks about “the 
paradox of increasing perverse effects”: the more 
the management wants to influence the primary 
process using performance measurement, the more 
negative effect will occur. The rationalization of 
this paradox is twofold. In the first place, profes- 
sionals will try to “protect” themselves from the 
performance measurement. Secondly, he states 
that the more the functioning of performance 
measurement is tangible, the less justice is done 
to the plural, co-productive and dynamic character 
of the performances. Moreover, De Bruijn (2002) 
says that this paradox is particularly difficult. If 
professionals does not conform to the measure- 
ment system and screen off themselves, then this 
can be an incentive for strategic behaviour, which 
results in a performance measurement that is not 
effective. If one is willing to conform on the other 
hand, negative effect can still occur, for example 
a too strong emphasis on criteria for success, as 
a result of which measuring is also not effective. 

Both Smith (1995) and De Bruijn (2002, 
2006) reflect about strategies to counteract and 
reduce as much as possible the negative impact 
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of performance measurement. An element in both 
studies is the use of subjective indicators in the 
form of satisfaction measurements. De Bruin sees 
this as an alternative appraisal system, besides 
performance measurement. We do notagree at this 
point, in the sense that those appraisals also can be 
an inherent part of the performance measurement 
systems. The perception of the functioning of a 
service can exert a substantial influence on the 
functioning of this service (Stipak, 1979). There 
is little doubt about the idea that such subjective 
indicators — e.g. ‘are you happy with the opening 
hours ...”— alone are not the only truths. 

Nevertheless, to our opinion such indicators 
are crucial in the measurement of performance, 
although prudence in using them is in order. Yet, 
the use of subjective indicators has been frequently 
used in policy evaluation based on the simple 
assumption that such indicators also are good 
measures for the quality of the service. Besides 
other reasons, the lack of knowledge or visibility 
ofa service can systematically bias the subjective 
evaluation of that service (Trentetal., 1984). Those 
considerations need attention, before the informa- 
tion produced by subjective indicators can be used 
in the policy evaluation (Anderson, et al., 1984). 
Subjective indicators can give several types of 
policy-relevant information to the policy makers. 
If these subjective indicators are by themselves 
sufficient to assess the quality of the service is, 
however, another question (Stipak, 1979). 

There are a number of other strategies which 
can make performance measurement better. Ac- 
cording to De Bruijn (2006), one can accept a 
diversity of (even competitive) definitions of 
products. Moreover, the fact that target variables 
are mutually competitive is in itself not a problem 
(Swanborn, 1999). Using a variety of product 
definitions offers a number of advantages: it can 
reduce conflicts, it offers a richer picture of the 
achieved performance, it moderates perverting 
behaviour and can also be interesting for manage- 

ment. The diversity of product definitions can be 
favourable for the authority of the results. If an 
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organisation scores for example on the basis of 
all product definitions bad/good, the conclusion 
is based on a firm argument. 

Another advice of De Bruijn (2006) is the prohi- 
bition ona monopoly of “semantics”. Performance 
measurement can have a concealing meaning and 
different meanings to different people. This prob- 
lem increases if the distance between the producer 
and the recipient of the data and figures produced 
by performance measurement grows. The larger 
this distance, the more difficult it is to interpret 
and the higher also the alleged hardness of the 
data will be. This prohibition on a monopoly of 
semantics can be realised by making clear agree- 
ments between for example the managers and the 
professionals. 

Limiting the functions and the forums, to which 
performance measurement is used, frequently 
helps performance measurement. The more func- 
tions and forums the measurement has, the higher 
also the chance on the paradox of increasing 
perverse effects. Therefore, clear appointments 
are essential for the success of performance mea- 
surement and especially for avoiding negative 
effects. A similar recommendation can be found 
in the work of Smith (1995), according to whom 
negative effects can be thwarted by involving 
employees at all levels in the development and 
the implementation of performance measurement 
systems. He pleads for a flexible use of perfor- 
mance indicators and not to use them only as a 
control mechanism. 

De Bruin (2006) is in favour ofa strategic selec- 
tion of the products that will be visualised by the 
measurement system. As such, one can opt for a 
heavy ora lightmeasurement.‘ The selection of the 
products is a strategic choice, mostly motivated by 
the striving towards completeness - although this 
frequently leads to an overload of information and 
such a measuring is not cost effective - whereas 
with an intelligent selection of a more limited 
number of products, one can exert influence on 

the organisation as a whole. Furthermore, there 
is a difference between the operationalization of 
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plural products (for example giving education) 
versus simple products (for example the number 
of repeaters). It is chosen simple products and only 
a limited number of plural products. In a sense, 
contrary to this, in the work of Smith (1995) it is 
argued to quantify each objective. For Smith, a 
critical attitude is needed towards the design of 
performance measurement systems: we should 
ask ourselves what can and can not be measured. 

Another strategy is the management of the 
competitive approaches product and process. 
He means that it is essential for performance 
measurement not only to focus on the output, but 
also to look at the processes or throughput. It is 
important to give sufficient attention the so-called 
black box in which input is translated to output 
(Pawson & Tilley, 1997). Moreover, the risk on 
negative affects is higher if product indicators are 
used (Brans, et al., 2008). That is an extra reason 
to incorporate process indicators in the evaluation. 

The essence of the strategies suggested by De 
Bruijn (2002) area distant/reserved use, space and 
trust on the one hand and agreeing on rules of the 
game on the other side. Moreover, Smith (1995) 
givesanumber of suggestions which are especially 
relevant when objectives are rather vague and the 
measurement of the output is problematic. In that 
case, he emphasises the use of subjective indica- 
tors, by measuring the satisfaction of customers. 
In the same context it is better to leave the inter- 
pretation of performance indicators to experts and 
to do a conscientiously audit. 


CONCLUSION 


Nobody disputes that in public policy, evalua- 
tion — using objective indicators — has become 
increasingly important. Within the framework of 
the growing complexity of the policy environment 
and in view of the need for obvious accountability, 
policy makers strive to an evidence-based policy 
that must guarantee (the impression of) a high 
return on public investment. 


How these criteria and goals of policy evalu- 
ation in general and measuring performances in 
particular can be met, which functions one can 
or must impute on these evaluations and which 
forums they can be discussed at, ... is object of 
discussion. These discussions are related to the 
criteria of good governance and of moral and 
democratic legitimacy. 

Given that policy evaluation questions whether 
and how objectives are obtained, taking into ac- 
count the means deployed, it seems logical that 
the choices of objectives and resources prevail, 
or are extern or prior to, the evaluation. Policy 
evaluation can lead to policy-learning and to the 
strengthening of accountability processes and is 
therefore by nature beneficial. However, the need 
for objective, quantified evaluation entails dan- 
gers and risks. Calculating policy outcomes can 
have influence on the choice of policy objective 
and therefore on the formulation of the problem. 
Evaluation can influence the policy process on 
an unforeseen, improper manner. Objectives 
(problem solutions) can be chosen in such a way 
to maximise the chances of a good evaluation. 

The negative effect of performance measure- 
ment originates to an important degree from the 
tension between managers and the professionals, 
between those who deliver policy and those who 
measure and evaluate this deliverance these profes- 
sionals could fear a loss of autonomy because of 
policy evaluation (Swanborn, 1999) and therefore 
try to influence the evaluation process. Not only 
are these relations important for the success of 
good measurement and policy learning. As we 
have shown, the measurement in itself can have 
negative effects. 

These negative effects are important in the de- 
bate about performance measurement and the way 
we deal with data: what data should be gathered? 
What can we do with it? How should we analyse 
it? How can we publish or comment the analysis? 
Reminding the words of Ridgway - “the cure is 
sometimes worse than the disease” (1956, p.240) 
— we should bear in mind that evaluation is not 
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good by definition. Ridgway dealt with the ques- 
tion if it was appropriate at all to use quantitative 
measurement instruments in order to analyse and 
evaluate performance in the public sector. Half 
a century later, the usefulness of quantifying the 
performance within the public sector seems not to 
be the main subject of discussion. But more then 
50 years after his conclusion, we can agree that 
the perverse impact of performance measurement 
is insufficiently recognised. Fortunately, there are 
possible strategies to reduce specific perverse im- 
pact. Taking care of perverse effects is therefore 
an important task, and a difficult one. 

Policy evaluation serves a noble objective in 
which “good” governance is of the first and most 
importance. In that respect, it is a means to an 
end. However, since it has become big business 
with many people making money out of it, since 
governments can no longer do without evaluating 
their performances, since the political pressure to 
evaluate is increasing, it sometimes has become 
an end in itself. Who will evaluate the evaluation? 
If we want to make public policy better and more 
accountable, we should look— more than we have 
done in the past — at the mechanisms that are used 
to bring about these effects. There is probably no 
such thing as a true “objective” evaluation. 

In this chapter, we have demonstrated that ask- 
ing the question ‘what data do we need?’ in order 
to start the analysis of that data, including data 
mining, has a severe impact on the precise nature 
of that data, and therefore, on the knowledge that 
data mining can produce. Performance measure- 
ment is inevitable, so we need data to analyse. 
But looking for data, trying to translate policy 
in quantitative standards that can be measured, 
could change the reality that is captured inside 
the data, because it influences policy makers and 
their actions. Therefore, attention needs to pay to 
the precise way in which we measure, in this case 
policy performance, and gather data. If we do not 
take these influences into account, if we are not 
aware of them, data mining will be applied on data 
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that is less representative of the reality of which 
we like to know more about. 

Once we have solid empirical data, and the 
above posed difficulties are dealt with, the actual 
data mining can begin. However, also data mining 
must be a mean to an end. As with the use of all 
data, human (critical) judgment remains a critical 
factor (Mead, 2003). As noted by Siegel (cited in 
Mead, 2003), making data gathering integral to an 
organization’s daily operational fabric tends to be 
far more difficult than designing and building the 
system. Gathering qualitative data by and about 
the public sector is an important en necessary 
step towards data mining in the public sector. 
However, we had to be taking into account the 
specific characteristics of the public context (cf. 
Kostoff & Geisler, 1999). 
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KEY TERMS AND DEFINITIONS 


Efficiency: Indicates the relation between 
the deployed resources (input) and the delivered 
products or service (output) (1/0). It is about the 
quantity input necessary per unit of output (I/O). 

Effectiveness: Concerns the cause and con- 
sequence relation between output and outcome. 
Does policy had the aimed effect (within the pos- 
tulated period)? To what extent are there desired 
or undesirable side effects? 

Evaluation: Evaluation is the systematic and 
objective determination of the worth or merit of 
an object”. 

Input: The financial, human, and material 
resources required to implement an operation. 

Output: The products, capital goods and 
services which result from a development in- 
tervention; may also include changes resulting 
from the intervention which are relevant to the 
achievement of outcomes. 

Outcome: The likely or achieved short-term 
and medium-term effects of an intervention’s 
outputs. 

Performance: The degree to which an opera- 
tion or organisation (...) operates according to 
specific criteria/standards/guidelines or achieves 
results in accordance with stated goals or plans. 


Performance Measurement: A system for 


assessing performance of development interven- 
tions against stated goals. 


Productivity: The inverse of efficiency, it 


is to quantity the output one can realise per unit 
input (O/T). 


Throughput: The processes involved when 


converting inputs into outputs within the organi- 
sation. 
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For example Zondergeld-Hamer (2007) 
discuss the role of religious expectations. 

For example the increase of the life expec- 
tancy in good health (effect) can be related 
to the number of vaccinations, the number of 
persons reached with prevention campaigns 
(output) or with the quantity of government 
funds (input). It is not because the indicators 
increase in the same direction and with the 
same speed that the life expectancy can be 
explained only by the preventive policy. 
Also the progress in medicine (to which 
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governments contribute by means of educa- 
tion and R&D), the rise in living standard 
(to which governments contribute by means 
of economic stimuli) resulting in healthier 
feeding habits, better housing and faster 
medical treatment have their influence. 
More specific: tunnel vision, suboptimiza- 
tion, myopia, measure fixation, misrepre- 
sentation, misinterpretation, gaming and 
ossification. Of course, further in the text we 
discuss the interpretation of these concepts. 
The distinction between them can be made 
by whether or not there is a link to a kind 
of sanctioning to the appraisal on the basis 
of the performance measurement systems. 
It is attractive for the professionals, it gives 
them space and it is for managers also 
easier to settle scores on simple products. 
Moreover, a focus on simple products can 
have a positive impact on plural products. 
Finally, the good behaviour of professionals 
can be rewarded by a restricted selection of 
products. Fora more detailed argumentation 
we can refer to De Bruin (2002, pp.152-154). 
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ABSTRACT 


In local government, the financial analysis is focused on evaluating the financial condition of municipali- 
ties, and this is normally accomplished via an analytic process examining four dimensions: sustainability 

or budgetary stability), solvency, flexibility and financial independence. Accordingly, the first goal the 
authors set out to achieve in this chapter is to determine the principal explanatory factors for each of 
the above dimensions. This is done by examining a wide range of ratios and indicators normally avail- 
able in published public accounts, with the aim of extracting the most significant explanatory variables 

fr sustainability, solvency, flexibility and financial independence. They use a rule induction algorithm 
called CHAID, which provides a highly efficient data mining technique for segmentation, or tree grow- 
ing. The research sample includes 877 Spanish local authorities with a population of 1000 inhabitants 

or more. The developed model presents a high degree of explanatory and predictive capacity. For the 
levels of budgetary sustainability the most significant variables are those related to the current margin, 
together with the importance of capital expenditure in the budgetary structure. On the other hand, the 
short-term solvency depends on the liquid funds possessed by the entity. The flexibility, however, depends 
mainly on the financial load per inhabitant of the municipality, on the total sum of fixed charges. Finally, 
nancial independence depends fundamentally on the transfers that the entity receives and on the fiscal 
pressure, among other elements. 
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INTRODUCTION 


In the context of private enterprise, profitability 
is the main variable analyzed and monitored by 
researchers and company managers. At the theo- 
retical level, the DuPont Model establishes the 
relationships between profitability and a group 
of variables and accounting ratios such as asset 
turnover, sales margin and financial leverage. 
However, in the public sector and even more 
specifically in the area of local government this 
concept is overshadowed by other magnitudes that 
reflect the success or otherwise of management 
performed in the public interest. Interest is thus 
focused on evaluating the financial condition of 
municipalities, and this is normally accomplished 
viaan analytic process examining fourdimensions: 
sustainability (or budgetary stability), solvency, 
flexibility and financial independence (Groves 
et al., 2003). 

Accordingly, the first goal we set out to achieve 
in this chapter isto determine the principal explana- 
tory factors for each of the above dimensions. This 
is done by examining a wide range of ratios and 
indicators normally available in published public 
accounts, with the aim of extracting the most sig- 
nificant explanatory variables for sustainability, 
solvency, flexibility and financial independence. 
We seek to quantify these relationships and their 
explanatory variables and thus obtain the relevant 
profiles, i.e., the combinations of economic- 
accounting features of the best municipalities with 
respect to their levels of sustainability, solvency, 
flexibility and financial independence (Zafra- 
Gomez et al., 2009a; 2009b; 2009c). 

In this context, drawing up a body of rules 
making it possible to determine the probability 
of an organization presenting a better or worse 
financial condition is a crucial issue. The im- 
portance for the public manager is determined 
by the fact that the latter officer must be aware 
of the variables to be controlled when seeking a 
stable financial situation with regard to the four 
elements being considered. Moreover, the utility 
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of this methodology is that it provides a control 
instrument for municipal supervisory agencies 
(central or regional government) as those local 
authorities that face an Emergency Financial 
Condition would be obliged to draw up a viabil- 
ity plan to improve it, and their autonomy would 
be reduced by the supervision by such agencies. 
Taxes would have to be increased, within legal 
limits, and/or the services provided would have to 
be cut back, with the ensuing loss of popularity. 
Another real consequence that would affect the 
financial condition of such local authorities would 
be the denial to them of access to indebtedness 
facilities for investment projects. At the other 
extreme, those authorities presenting an excellent 
financial condition would be subjected to fewer 
controls and supervision, and thus their autonomy 
would increase; they would have greater access 
to certain forms of financial assistance and to 
indebtedness facilities. 

This analysis makes use of a rule induction 
algorithm called CHAID (Chi-squared Auto- 
matic Interaction Detector, Kass, 1980), which 
provides a highly efficient data mining technique 
for segmentation, or tree growing, so that a tree 
of rules may be derived to describe different 
segments within the data in relation to the output 
(dependent) variable, allowing us to classify local 
governments according to the different values of 
their accounting ratios (explanatory variables or 
predictors). 

The chapter begins with a review of the main 
empirical studies carried out to measure financial 
crises affecting local authorities. We go on to 
outline our methodological proposal to achieve 
the above aims, explaining the analytic technique 
to be applied, and then describe the sample and 
the variables. Subsequently, the main results of 
the analysis are discussed, firstly by means of an 
exploratory analysis, and then from an explana- 
tory viewpoint. Finally, we highlight the most 
important issues raised in this chapter and suggest 
future areas for investigation. 
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HOW SHOULD FINANCIAL 
CRISES IN LOCAL GOVERNMENT 
BE MEASURED? 


The study of municipal financial crises isaresearch 
objective that, while nota novel one, remains topi- 
cal and continues to attract researchers around the 
world. Today, projects to evaluate fiscal distress 
continue to be developed in various US States 
(Khola et al., 2005). In Australia, as observed by 
Dollery et al. (2006), various local authorities are 
experiencing severe or chronic fiscal stress, while 
in the UK, the Audit Commission published a pa- 
per in February 2007 in which it was commented 
that “the assessment focuses on the importance of 
having sound and strategic financial management 
to ensure that resources are available to support 
the council’s priorities and improve services”. 

Traditionally, the studies undertaken concern- 
ing the areas of budgetary, financial and economic 
information have tended to be based on the analysis 
of the financial situation of the organization in 
question. However, in recent years, such stud- 
ies have been addressed in the framework of a 
broader concept, known asthe financial condition. 
A growing volume of research in this respect is 
being carried out, with the aim of acquiring greater 
information about aspects that characterize the 
development of public activities. 

The principal objective of our study of financial 
condition is to determine a means of measuring 
the financial crises that affect municipalities. 
Traditionally, financial condition is taken to be 
the ability of a government to provide services 
and to meet its future obligations (GASB, 1987); 
it can be measured by considering the situation of 
its net assets, its budget balance or the net cash 
position (GASB, 1999). Thus, if the institution is 
capable of meeting its debts and, at the same time, 
providing acceptable levels of services, we may 
say that it is in good financial health. For Groves 
etal. (2003), financial health results from various 
elements, which can be measured by means of 
four magnitudes that are related to cash solvency, 


budget solvency, long-run solvency and service- 
level solvency. Cash solvency is understood to be 
the entity’s ability to generate sufficient liquidity 
to pay its short-term debts. Budget solvency is 
its ability to obtain sufficient budgetary income 
without entering into deficit. Long-run solvency 
concerns a government’s ability to respond ad- 
equately to all its long-term obligations, while 
service-level solvency is defined as expressing 
the entity’s capacity to provide the level and qual- 
ity of services necessary for the wellbeing of the 
community in question. These four concepts of 
solvency embrace what the above authors have 
termed the financial factor. 

The financial factor reflects the condition of 
the government’s internal finances. For other 
authors, this concept is focused on the study of 
its assets and of liabilities (which may be of im- 
mediate effect or could have to be met at some 
future time), together with an analysis of income 
and expenditure trends and of the particular factors 
that characterise institutions when they acquire 
financial liabilities, within a particular time span 
anda specific, clearly-bounded economic dimen- 
sion or space (Copeland & Ingram, 1983; Berne, 
1992; Clark, 1990, 1994; Groves et al., 2003). 

For Greenberg and Hiller (1995) and CICA 
(1997), the financial condition of an organization 
can be measured by means ofa series of indicators 
related to its sustainability, flexibility, vulnerability 
and short-term solvency. 

Sustainability refers to an organization’s abil- 
ity to maintain, promote and protect the social 
welfare of the population, employing the re- 
sources at its disposal. Flexibility is understood 
as a body’s capability to respond to changes in 
the economy or in its financial circumstances, 
within the limits of its fiscal abilities, a capabil- 
ity that depends on the degree to which it is able 
to react to such changes, via modifications to tax 
rates, public debt or transfers. Finally, vulnerabil- 
ity is understood to be an organization’s level of 
dependence on external funding received via 
transfers and grants. 
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Figure 1. Elements of financial condition 


on 
Sustainability 


Flexibility 


This is the context in which we put forward 
the following proposal for a financial perfor- 
mance model based on the concept of financial 
condition. To measure the financial factor, let us 
consider some of the definitions proposed by the 
above-mentioned authors. On the one hand, it is 
necessary to measure the concepts of short-run 
solvency (the capability to generate short-run 
liquidity) and of budget solvency (the capability 
to respond to budgetary obligations) (Groves et 
al., 2003). This concept is divided into the more 
specific aspects of flexibility, sustainability and 
vulnerability (Greenberg and Hillier, 1995; CICA, 
1997). Finally, we measure long-run solvency by 
studying whether, in the course of the financial 
years that are under study, the local authority 
officers have been capable of improving the au- 
thority’s financial condition. 

However, the problem of measuring the above 
lies in the fact that four different elements must be 
addressed, in each of which each local authority 
may present different levels. Thus in predicting 
financial condition we must obtain a valuation of 
each of the elements of which it is constituted. 

In view of these considerations, a study of the 
financial condition of local authorities requires an 
individualized analysis of each of its constituent 
elements, as these components are very heteroge- 
neous. Therefore, we considered the application 
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Financial 
condition 


m oN 


Short-term solvency 


Vulnerability 


of the CHAID methodology to each of the four 
elements that constitute the financial condition: 
one to examine an entity’s budgetary sustainabil- 
ity, another for its short-term solvency, another 
for its flexibility (or indebtedness) and, finally, 
one for its levels of financial independence from 
other entities. 


METHODOLOGICAL 
PROPOSAL: CHAID 


The phenomenon to be studied requires us to 
establish certain rules for the behaviour of the 
different elements that constitute the financial 
condition of a local authority, using a specific 
data set. Such a study can be carried out, among 
other means, using decision trees, which provide 
a set of rules that are hierarchies in such a way 
that the final decision can be taken through the 
implementation of logical decisions, from the tree 
trunk towards the leaves. 

In fact, a great many algorithms are capable of 
generating rules based on decision trees, includ- 
ing CLS (Hunt et al., 1966), ID3 (Quinlan, 1979), 
CART (Breiman et al., 1984) and C4.5 (Quinlan, 
1993). In the present chapter, we implemented 
the algorithm known as CHAID (Chi-squared 
Automatic Interaction Detector), which is simple 
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to apply and widely used. This classification 
mechanism, originally proposed by Kass (1980), 
has been used extensively by many authors in 
different studies to derive a tree of rules which 
help understanding of many phenomena (Santin, 
2006; Galguera, 2006; Grobler, 2002; Strambi, 
1998). It was also extended to ordinal dependent 
variables by Magidson(1993), who illustrated how 
this extension could be used to take advantage of 
fixed scores suchas profitability, foreach category 
of the dependent variable when such scores are 
known, as well as how to estimate meaningful 
scores when category scores are unknown. 

As a segmentation tool, CHAID presents 
important benefits: firstly, the technique is not 
based on any specific probabilistic distribution, 
but solely on chi squared goodness-of-fit tests, 
from contingency tables. These, given an accept- 
able sample size, almost always function well. 
Secondly, it makes it possible to determine a 
variable to be maximized. This is very desirable, 
and not always possible with other segmentation 
chniques. Thirdly, classification by segments is 
ways straightforward to interpret, as its results 
provide intuitive rules that are readily understood 
by non experts—which is not the case, forexample, 
with Cluster Analysis. Fourthly, this technique 
ensures that the segments always have statistical 
meaning; they are all different, and are the best 
possible, given the data provided. Accordingly, 
the classifications made using the rules found 
are mutually exclusive, and so the decision tree 
identifies a single response based on a calculation 
of the probabilities of belonging to a certain class. 

Finally, the CHAID unlike other algorithms 
such as CART (Breiman et al., 1984) is capable 
»f constructing non-binary algorithms, i.e. it can 
present more than two branches, or data divisions, 
according to the categories to be explained, for 
each node. The algorithm performs non-symmetri- 
cal partitions that are optimal for each explicative 
variable, and which are derived from contingency 
tables based on the chi-squared statistic. After a 
series of iterations, the algorithm establishes the 


al 


point at which the structure created is optimum, 
by demanding a level of significance for each 
branch created. 

CHAID provides a set of rules' that can be 
applied to a new (unclassified) dataset to predict 
which records will have a given outcome. Using 
the significance of a chi-squared test, CHAID 
evaluates all of the values of each potential 
explanatory variable, by merging those values 
that are judged to be statistically homogeneous 
(similar) with respect to the dependent variable 
(target) and maintaining all the others, which 
are heterogeneous (dissimilar). It then selects 
the best predictor to form the first branch in the 
decision tree, such that each leaf node is made up 
of a group of homogeneous values of the selected 
field. This process continues recursively until the 
tree is fully grown. 

Let us now examine in detail the methodologi- 
cal process to be followed to apply the technique. 
A complete description of this algorithm with a 
tutorial reference is showed by Kass (1980), Biggs 
(1991) and Goodman, L. A. (1979). Also, Santin 
(2006) uses a simplification of this application. 


Binning of Continuous 
Explanatory Variables 


In the first step, continuous explanatory variables 
are automatically discretized or binned into a set 
of ordinal categories. This process is performed 
once for each continuous explanatory variable in 
the model. Discretization can be done through 
various machine learning algorithms for building 
decision trees or decision rules, in particular by the 
CHAID algorithm, which we apply. We are aware 
that there are several methods for binning into a 
set of categories, for example, the one proposed 
by Berka (1998), which will be studied in future 
research, to compare results with those described 
in this chapter. 
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Merging Categories for 
Explanatory Variables 


All the explanatory variables are merged to 
combine categories that are not statistically dif- 
ferent with respect to the dependent variable, and 
each final category of an explanatory variable X 
represents a leaf node if that variable is used to 
split the node. For each explanatory variable X, 
the algorithm finds the pair of categories of X 
that is least significantly different (indicated by 
the largest p-value) with respect to the dependent 
variable Y. The method used to calculate the p- 
value is the chi-squared test: 


yoy [n-m 
eee 


j=l i=l Mij 


where n, = Sly =iAy, = j) is the ob- 


served cell frequency and mi is the expected 


estimated cell frequency for cell (x, =iy,= j) 


under the null hypothesis of Independence. The 
corresponding p value is given by 


p= Pn > x’), where Ni follows a chi- 


squared distribution with degrees of freedom d = 
(J— 1)(I— 1). The frequency associated with case 
n is noted by f.. 

Then, it merges into a compound category the 
pair that gives the largest p-value, and calculates 
the p-value based on the new set of categories of 
X. This represents one set of categories for X. 
The process is repeated until only two categories 
remain. Then, the sets of categories of X gener- 
ated during each step of the merge sequence are 
compared, to find the one for which the p-value 
in the previous step is the smallest. That set is 
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the set of merged categories for X to be used in 
determining the split at the current node. 


Splitting Nodes 


Once the categories have been merged for all the 
dependent variables, the algorithm selects the 
explanatory variable with the largest association 
with the dependent variable, the one for which the 
chi-squared test has the smallest p-value, and if 
this value is less than or equal to the a split (the 
split threshold), then that variable is used as the 
split variable for the current node. Each of the 
merged categories of the split variable defines 
a leaf node of the split. After the split is applied 
to the current node, the leaf nodes are examined 
to see if they warrant splitting by applying the 
merge/split process to each in turn. This process 
continues recursively until the tree is fully grown. 


RESULTS OF THE MODEL 
Support 


The support for a scored record is the weighted 
number of records in the data in the scored record’s 
assigned terminal node (t), i.e., the number of 
records of each rule. N ®© is the weighted 


number of records in node ¢ with category j (or 
the number of records if no frequency or case 
weights are defined): 


and Ni; is the weighted number of records in 
category j (any node): 


N= wo 


ieT 
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Response (Confidence) 


The confidence for a scored record is the propor- 
tion of weighted records in the data in the scored 
record’s assigned terminal node (t) that belong 
to a selected category j, modified by the Laplace 
correction (Margineantu, 2001), with k being the 
number of categories: 

_Npa(t)+1 


Response(%) r 6 ET 
: 


The level of confidence (%) of each rule (ter- 
minal node) shows the proportion of records of 
each rule that belong to a selected category j. The 
level of confidence of a set of rules can also be 
defined as the proportion of records of this rule 
set belonging to a given category /. 


Index 


The index of each of the rules obtained for a given 
category j is obtained as the ratio between the 
level of confidence for each rule (terminal node) 
and the level of confidence of the category j in the 
total sample (i.e., 25%, as the sample is divided 
into quartiles). 

Therefore, it is obtained by dividing the pro- 
portion of records that present category j in each 
terminal node (rule) into the proportion of records 
presenting category j in the total sample (25%). 
Thus, it represents the increased probability of 
belonging to the selected category j that contains 
the records presenting the characteristics defined 
for each rule. By accumulation, thus, the index of 
a set of rules can be obtained as the ratio between 
the proportion of records presenting category / in 
this rule set and the corresponding proportion to 
be found within the total sample (25%). 


Gain 


The gain for each terminal node (rule) can be 
defined, in absolute terms, as the number of re- 
cords ina selected category j. For a set of rules or 
terminal nodes, and in percentage terms, the gain 
summary provides descriptive statistics for the 
terminal nodes of a tree, and shows the weighted 
percentage of records in a selected category j: 


eA) 
Pp BAIR 


Gain (%) = glij) = 


where x(j) = | if record x, is in category j, and 0 
otherwise. 


Risk Estimates 


Risk estimates describe the risk of error in pre- 
dicted values for specific nodes of the tree and 
for the tree as a whole. The risk estimate r(t) of 
a node ¢ is computed as: 


4 (1) P 7 2 Naj (*) 


where N, (£) is the sum of the frequency weights 
for records in node t in category j (or the number 
of records if no frequency weights are defined), 
and N, is the sum of frequency weights for all 
records in the sample. 

The risk estimate R(T) for the tree (T) is cal- 
culated by taking the sum of the risk estimates 


for the terminal nodes r(t): 


where T” is the set of terminal nodes in the tree. 
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ANALYZING THE FINANCIAL 
CONDITION OF LOCAL 
AUTHORITIES IN SPAIN 
USING CHAID 


Sample 


The database contains economic, financial and 
budgetary data from 877 local authorities with a 
population of 1000 inhabitants or more, for the 
period 1992-1999 (Source: Directorate General 
for Financial Coordination with Local Authori- 
ties— Spanish Ministry of Economy and Finance). 
The data were provided by the latter Directorate 
and are available from them on request. 

The final sample was constituted of 6200 cases, 
once the appropriate filtering had been performed 
for cases in which we did not possess all the val- 
ues for the variables defined. In addition, taking 
into account the considerable standard deviation 
of many of the variables and their nature of ratio 
variables, as well as the assumption that the sample 
collection may include mistakes due to the lack 
of precision which can be assumed to be inherent 
in the accounting of small corporations, it was 
decided to remove the extreme cases, these being 
defined as those in which the standard deviation 
for each variable was exceeded by five times or 
more. We must be aware that the extreme cases 
were actually disposed of in some cases because 
they lacked economic sense, were municipalities 
with fewer than 1.500 people who practically do 
not follow any strict accounting or budgetary. 
Not help us to predict variables that warn us of 
the financial crisis. 


Variables 


The aim of this research was to identify the most 
significant explanatory variables of sustainability, 
solvency, flexibility and financial independence. 
Accordingly, four models were created, with 
each of these parameters, in turn, being taken as 
the dependent variable. As possible explanatory 
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variables, we used a set of ratios and indicators 
commonly applied by local authority managers 
and analysts. The definitions of the variables 
used are shown in the Appendix 1, emphasizing 
those whose behaviour we set out to explain in 
the models created. 


Dependent Variables Categorized 


Each of the dependent variables to be studied is 
categorized into quartiles, in order to obtain an 
indication of the presence of the local authori- 
ties in each of these quartiles, which define four 
budgetary situations (numbered 1 to 4, in rising 
order according to the values of the dependent 
variable). This categorization into quartiles is 
applied by many authors in studies which use the 
CHAID technique such as Santin (2006), Dills 
(2005) and Gonzalez et al. (2002). In our chapter, 
what is especially interesting is the focus on the 
first and fourth quartiles, which represent the best 
and the worst budgetary situations (success and 
failure profiles). 


Descriptive Analysis 


Table 1 shows the descriptive analysis performed 
for the whole set of variables employed in this 
analysis. 


ANALYSIS AND RESULTS 
Exploratory Analysis 


Firstly, we perform an exploratory analysis of the 
variables considered to be dependent, and their 
relation with the other variables, for each of the 
elements constituting the financial condition. 
The first aspect to be analyzed is that of bud- 
getary sustainability. For this purpose, we use 
the variable Index of non financial budget result 
(BR), as defined above and which describes the 
entity’s capacity or need for self-funding of its 
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Table 1. Descriptive analysis of the variables employed in the chapter 


STANDARD 
DEVIATION 


VARIABLE MEDIAN 

I. Implementation of expenditure 0.13 0.85 
I. Implementation of current expenditure 0.06 0.93 
I. Implementation of capital expenditure EE a 0.29 0.7 
I. Implementation of receipts 0.86 0.14 0.87 
I. Implementation of current receipts 0.99 0.12 0.99 

I. Implementation of capital receipts 182.49 
Index of Current budget payments 0.92 0.06 0.93 
Index of Current budget receipts 0.84 0.12 0.86 

I. Public expenditure per capita 82145.83 


I. Current expenditure per capita 53508.89 
I. Significance of current costs 
0.92 
0.77 
0.43 
548769392.99 
0.41 
0.17 
23919.09 
0.29 
0.62 


Index of Expenditure rigidity 
I. Significance of current receipts 
Index of Taxation receipts 
Index of Fiscal pressure 
I. Receipts from current transfers 
Index of Gross savings 
I. Capital expenditure per capita 
I. Significance of capital expenditure 
Index of Capital funding 
Index of Net savings 
0.08 
7367.07 


I. Significance of financial load 


begin (Bae 
aD] w 
aA] w 


I. Financial load per capita 


I. Weight of financial load 


Current self-funding margin 


34173767 .26 


Outstanding debt of governing political party 


51083.62 46848.49 


0.14 0.67 


0.14 


1318568835.10 


© 
© 
N 


111396668.00 


24111.47 i 


ojo 
wje 


17525.88 


13112.46 
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4198.30 
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I. Significance of cash surplus 
Index of Immediate liquidity 0.39 
Index of Expenditure results Pees UU. Ba 0.2 0.14 
Mean receipts lapse 214.97 
Mean payment lapse 170.69 442.61 112.21 


200323652.57 -600000.00 
0.1 


Index of available cash 0.01 
Index of staff costs 0.44 
Index of current transfers effected 0.08 


23862.70 

21949.84 
2842.44 

64955.36 


I. Public expenditure per capita 


I. Goods and services expenses per capita 


I. Financial expenses per capita 


I. Current receipts per capita 


24476.24 
19416.06 18501.73 
5076.91 1628.92 


20839.90 


55594.33 


58964.43 


continued on following page 
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Table 1. continued 


VARIABLE 


Index of non financial budget result 


budget (non-financial surplus or deficit), without 
needing to resort to indebtedness. The second 
element is that of solvency, which is measured 
using the variable short-term solvency (STS), as 
defined above. This variable expresses the entity’s 
capacity to meet its short-term obligations from 
the liquid funds at its disposal and from receipts 
pending. The third aspect to be taken into ac- 
count is that of flexibility, which measures the 
levels of local authorities’ indebtedness. For this 
purpose, we use the variable index of the weight 
of financial load (FL), which reflects the local 
authority’s capacity to meet the obligations of its 
debts, i.e. capital repayment plus interests, from 
its current receipts. Finally, we measure the level 
of financial independence of local authorities, 
which determines whether the entity employs 
more or less of its own resources (via taxation). 
This aspect is studied by means of the variable 
index of taxation receipts (TR). In all of the above 
cases, the variables are categorized by quartiles, 
such that the variable takes values from 1 to 4. 


Exploratory Analysis of 


Correlations; the Most Important 
Explanatory Variables 


The exploratory analysis of correlations between 
each of the explanatory variables and the depen- 
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dent variable is shown in Table 2, Table 3, Table 
4, and Table 5, including the mean values for each 
of the categories of the target variable and the F- 
test statistic of independence’. The tables include 
only the main variables, i.e. those presenting the 
greatest explicative power (F-test) in relation to 
the dependent variable being studied. 

Table 2 shows that the variables with the great- 
est explanatory capacity of the Index of non fi- 
nancial budget result are the index of available 
cash, the index of gross and net savings, the index 
of capital funding and the current self-funding 
margin. On the contrary, other variables, such as 
the index of implementation of capital expenditure, 
the index of current budget receipts and the index 
of coverage of the financial load do not contribute 
explanatory capacity. 

Be that as it may, there are many variables 
that enable us to identify differences regarding 
the output variable, and a model is needed to help 
summarize the body of information available and 
organize it so that it may be useful for explanatory 
and predictive purposes. 

For Short-term solvency, the most influential 
variables are the index of gross savings, the net 
savings index and the margin of current self- 
funding, while on the contrary, little information 
is provided by the index of current transfers and 
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Table 2. Analysis of means and correlations of the variables with BR 


ES Ss Ge fe Ge 
(BR 1) (BR 2) (BR 3) (BR 4) F value p-level 


INDEPENDENT VARIABLES 


Index of available cash 


Index of Gross savings 


Index of Capital funding 


Index of Net savings 0.074 0.114 0.177 198.025 0.000 
Current self-funding margin | 096 | 09% 0.886 0.823 198.025 0.000 
I. Significance of capital expenditure 0.349 0.278 0.234 186.141 0.000 


I. Implementation of capital receipts 
Poa | os | oms | ons | usor | ow _| 


I. Significance of current costs 


I. Implementation of receipts 


I. Implementation of expenditure 


Table 3. Analysis of means and correlations of the variables with STS 


Mean Mean Mean Mean 
INDEPENDENT VARIABLES (STS 1) (STS 2) (STS 3) (STS 4) 


ne a E 
0.027 0.087 0.121 0.171 209.644 0.000 


o A [voor] EAE | oom | 


I. Immediate liquidity 0.009 0.351 0.624 149.14 0.000 
Index of coverage of expenditure 1.01 0.988 0.962 136.598 0.000 


I. Significance of financial load f 


the index of implementation, among others (see present any significant relation with levels of 
Table 3). indebtedness are those concerning the entity’s 
The exploratory analysis reveals that levels cash balances. 
of indebtedness (see Table 4), measured by the From the prior analysis of the correlations 
variable weight of financial load, are related to between the variables to be studied and the other 
the variables index of fixed charges, index of net financial variables, we conclude that the variables 
savings and the margin of current self-funding, that are most strongly related to the index of 
among others. Most of the variables that do not taxation receipts are the index of current transfer 
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Table 4. Analysis of means and correlations of the variables with FL 


komsi | o | as | 01m | ons | 16059 | oom 


Table 5. Analysis of the means and correlations of the variables with TR 


INDEPENDENT VARIABLES Mean (TR 2) | Mean (TR 3) | Mean(TR4) |  Ftest | pleva | 
I. Receipts from current transfers 
I. Significance of current receipts oe s a T e 0.000 


erate of current expen- | 0.567 0.692 0.737 546.56 0.000 


1. Significance of capital expenditure 0.254 0.000 


= Goods and services: expenses | -tiegs3.073 1110130939. | 21e36683 |80178654 149.447 0.000 
per capita 


I. Current receipts per capita 49095.334 56725.752 65284.625 88715.71 140.204 0.000 
Sum of current receipts 195821007 560565257.2 | 1238412657 1502112779 138.721 0.000 


I. Implementation of capital ex- 0.781 0.663 0.585 132.056 0.000 
penditure 


receipts, the index of self funding (both current In short, it can be seen that the variables most 
and total), and the index of the significance of closely related to the indicators used to measure 
current receipts. Again, we find there is least rela- each of the indicators of financial condition are 
tion, in this respect, with the variables related to the index of gross savings, the index of net saving 
the entity’s level of short-term solvency. and the index of implementation of expenditure. 
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PREDICTIVE ANALYSIS WITH 
CHAID: SUCCESS AND FAILURE 
PROFILES FROM EACH ELEMENT 
OF FINANCIAL CONDITION 


Results Obtained with the Rules for 
Each Element of Financial Condition 


With CHAID modelling, the sample is segmented 
using aclassification tree to create a set of terminal 
nodes, with routes from the origin node (the whole 
sample) that constitute the profiles or rules for 
each of the categories defined in the variable to 
be explained’. Moreover, it should be taken into 
account that the main aim of the chapter is not to 
classify all cases that may arise in the environment, 
but to offer recommendations to local authorities 
regarding the main variables that influence the 
financial condition, as well as to ascertain the 
most suitable values for them. 

In the case of the study of budgetary sustain- 
ability, we are particularly interested in the profiles 
of the local authorities located in the extreme quar- 
tiles, i.e. in BR=4 (high level of sustainability) and 
in BR=1 (low sustainability), and in turn, within 
each of these categories, we focus on the most 
important profiles, which are those presenting the 
highest classificatory and predictive capacities in 
terms of the level of confidence 

Accordingly, on the basis of the general 
rule tree, we filter out the rules obtained for the 
categories BR=1 and BR=4, and after ordering 
them by level of confidence and gain, the most 
important rules in each category are selected. 
The end result, thus, is that we have the rules 
for the highest sampling decile in each category, 
representing 620 local authorities. The principal 
rules selected, in both cases, are illustrated in the 
Figure 2, which show the corresponding support 
and confidence levels. 

Thus, the main explanatory variables coincide 
with those identified previously in the explor- 
atory analysis of correlations and, therefore, these 


are the variables that must be controlled by the 
local authorities if sustainability is to be improved. 

The differentrules for BR=4 indicate the levels 
within which these variables should be situated 
in order to ensure budgetary sustainability, with 
a high level of probability. Thus, for example, 
Rule 13 indicates that when the index of gross 
savings is higher than 0.334 and the index of the 
significance of capital expenditure is lower than 
0.272, 100% (confidence) of the 118 (support) 
local authorities in the sample present very good 
levels of sustainability (upper quartile, BR=4). 

At the other extreme, we have the profiles 
of the authorities with the lowest levels of sus- 
tainability. For example, Rule 2, for a sampling 
support of 155 authorities, indicates that there is 
a 99.4% probability that the authorities with a 
gross savings index of less than 0.018, an index 
of capital funding of less than 0.747 and an index 
of the significance of capital expenditure of over 
0.162 will present very low levels of budgetary 
sustainability (lower quartile, BR=1). 

It would be possible to analyze all the selected 
rules in the tables in the same way, and thus obtain 
a series of profiles and/or recommendations pro- 
viding local authorities with quantitative control 
measures for obtaining high levels of budgetary 
sustainability’. 

For the case of the second element of financial 
condition considered, that of short-term solvency, 
the main rules obtained were those shown in Fig- 
ure 3. These tables show that the worst situations 
with respect to short-term solvency occur when 
liquid funds are scarce, when payments are made 
promptly and when more expenditure is imple- 
mented. On the other hand, for local authorities 
to have good levels of short-term solvency, they 
need a high index of liquidity, to receive payments 
promptly (in approximately three months), tomake 
their payments in about six months, to achieve 
substantial budgetary receipts and, in most cases, 
to have a moderate level of fixed expenses. 

The rules that most precisely determine when 
a local authority may have greater or lesser flex- 
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Figure 2. Principal rules obtained for BR=4 and BR=1 


RULES FOR BR=4 


Rule 1 ( 108; 0.963) 


If I. gross savings > 0.070 and <= 0.135 and I. capital funding > 1.046 

Rule 2 (66; 0.955) 

IFT. gross savings > 0.135 and <= 0.190 and I. significance of capital expenditure <= 0.117 and I. capital expenditure per 
capita <= 5763.515 


Rule 4 (95; 0.905) i 


IfI. gross savings > 0.190 and <= 0.225 and I. significance of capital expenditure <= 0.162 


Rule 5 (158; 1.0 


If T. gross savings > 0.225 and <= 0.334 and I. significance of capital expenditure <= 0.162 
Rule 7 (80; 1.0) 
If I. gross savings > 0.225 and <= 0.334 and I. significance of capital expenditure > 0.162 and <= 0.235 and I. capital 
funding > 0.446 
Rule 13 (118; 1.0) 


If I. gross savings > 0.334 and I. significance of capital expenditure <= 0.272 


RULES FOR BR=1 


Rule 2 (155; 0.994) 

If I. gross savings <= 0.018 and I. capital funding <= 0.747 and I. significance of capital expenditure > 0.162 
Rule 4 (113; 0.85) 

If I. gross savings > 0.018 and <= 0.070 and I. capital funding <= 0.351 

Rule 5 (133; 0.857 

If I. gross savings > 0.018 and <= 0.070 and I. capital funding > 0.351 and <= 0.747 and I. significance of capital 
expenditure > 0.200 

Rule 7 (130; 0.992 

If I. gross savings > 0.070 and <= 0.135 and <= 0.642 and I. significance of capital expenditure > 0.235 and I. 
capital funding <= 0.45 

Rule 11 (97; 0.979 


If T. gross savings > 0.135 and <= 0.190 and I. significance of capital expenditure > 0,359 and I. capital funding <= 
0.544 


Figure 3. Principal rules obtained for STS=1 and STS=4 
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RULES FOR STS=1 
Rule 3 (89; 0.764 
If I, immediate liquidity <= 0.049 and I income results> 0 and Mean payment lapse > 57.403 and <= 160.353 and l. 
Implementation of expenditure > 0.878 
Rule 4 (72; 0.917 
If I. immediate liquidity <= 0.049 and I income results> 0 and Mean payment lapse > 160.353 and <= 262.65 and 


Mean receipts lapse <= 186,209 


Rule 6 (130; 0.869) 

If 1. immediate liquidity <= 0.049 and I. income results > 0 and Mean payment lapse > 262.650 

Rule 7 (69; 0.913 

If I. immediate liquidity > 0.049 and <= 0.128 and Mean payment lapse > 93.692 and <= 196.967 and Mean 

receipts lapse <= 112.1 
Rule 9 (79; 0.97 

If I. immediate liquidity > 0.049 and <= 0.128 and Mean payment lapse > 196.967 and Mean receipts lapse <= 
247.690 and I. significance of current receipts <= 0.790 

Rule 10 (71; 0.803) 

If I. immediate liquidity > 0.049 and <= 0.128 and Mean payment lapse > 196.967 and Mean receipts lapse <= 
247.690 and I. significance of current receipts > 0.790 

Rule 14 (120; 0,808) 

If I. immediate liquidity > 0.128 and <= 0.203 and Mean payment lapse > 196.967 and Mean receipts lapse <= 
247.690 

Rule 15 (74; 0.811) 
IfI. immediate liquidity > 0.203 and <= 0.285 and Index of coverage of expenditure > 0.976 and Mean payment 
lapse > 112.202 and Mean receipts lapse <= 130.786 


RULES FOR STS=4 


Rule 6 (134; 1.0) 
If I. immediate liquidity > 1.125 and <= 1.909 and Mean receipts lapse > 93.862 and Mean payment lapse <= 
112.202 


Rule 9 (345; 1.0) _ 
If Index of immediate liquidity > 1.909 and Index of income results> 0 and Index of fixed charges <= 0.427 
Rule 10 (85;0.96: 


If Index of. immediate liquidity > 1.909 and Index of income results> 0 and Index of fixed charges > 0.427 and <= 
0.492 


Rule 11 (71; 1.0) 
If Index of immediate liquidity > 1,909 and Index of income results> 0 and Index of fixed charges > 0.492 
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Figure 4. Principal rules obtained for FL=1 and FL=4 


RULES FOR FL=1 
Rule 3 (367; 0.951) 


IfI. weight of GF_<= 121.894 and I. income results> 0 


Rule 6 (231; 0.944) 


IfI. weight of GF > 121.894 and <= 419.274 and I. Current receipts per capita > 28593.622 and Outstanding debt of 


governing political party > -1.742.320 and <= 533.754 


Rule 7 (103; 0.806) 


If L. weight of GF > 121.894 and <= 419.274 and I. Current receipts per capita > 28593.622 and Outstanding debt of 


governing political party > 533.754 


RULES FOR FL=4 
Rule 4 (128; 0.781) 


IfI. Financial expenses per capita > 3087.972 and <= 4197.914 and 1. Current receipts per capita <= 49704.129 


Rule 7 (91; 0,956) 


If |. Financial expenses per capita > 4197.914 and <= 6] 13.369 and 1. Current receipts per capita <= 62098.372 and 


I. fixed charges > 0.669 


Rule 8 (68; 0.853) 


If I. Financial expenses per capita > 4197.914 and <= 61 13.369 and I. Current receipts per capita > 62098.372 and 


<= 100892.467 and I. net savings<= 0.004 


Rule 10 (66; 0.924) 


IfI. Financial expenses per capita > 6113.369 and <= 26328.526 and I. fixed charges > 0.427 and <= 0.530 


Rule 13 (215; 0.991) 


Lip A Ee ee a ee ee 


If L. Financial expenses per capita > 6113.369 and 1. fixed charges > 0.530 and I. income results> 0 and I. L 


Implementation of current receipts <= 1.012 


Rule 14 (74; 0.905) 


IfI. Financial expenses per capita > 61 13.369 and I. fixed charges > 0.530 and I. income results> 0 and I. 


Implementation of current receipts 1.012 


ibility are shown in Figure 4. We find that the 
determinant variable, with respect to the rules 
determining the highest and lowest levels of in- 
debtedness, is that of financial expenditure on the 
total population. It can also be seen that in order 
to obtain the best results (FL=1), rules with 
fewer variables are needed, and so they present a 
lower degree of complexity than does the analy- 
sis of local authorities with lower levels of flex- 
ibility (FL=4). In addition to this variable, the 
authorities presenting worse levels of flexibility 
are also characterized by higher values of the 
variables fixed charges, low net savings and low 
current receipts per capita. In consequence, such 
local authorities must seek resources via indebted- 
ness. Those presenting the highest values for 
flexibility are notable for presenting high levels 
of current receipts per capita and moderate levels 
of accumulated debt. 

Finally, we show the rules for the element 
independence. From the analysis of the worst 
results, we conclude that in most cases, the local 
authorities need to have high levels of receipts 
from current transfers, together with relatively 
low values for the variable significance of current 


receipts and, in some cases, low levels of fiscal 
pressure (See Figure 5). 

The local authorities presenting the best results, 
with respect to financial independence, are char- 
acterized by low levels of receipts from current 
transfers, significant current receipts and rela- 
tively high indices of fiscal pressure. In addition 
to these variables, in some rules there is a low 
index of capital funding and moderate financial 
expenditure per capita. 

It can be said that the fact that the greater or 
lesser independence of local authorities largely 
depends on the transfers received from other enti- 
ties and on the importance attained by the sum of 
current receipts. In conclusion, the variables that 
seem to predict a better or worse financial condi- 
tion are the index of gross savings, the index of 
significance of capital expenditure and the index 
of fixed charges. These are the main variables that 
public sector managers should monitor in order 
to ensure good financial condition. 
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Figure 5. Principal rules obtained for TR=1 and TR=4 


RULES FOR TR=1 
Rule 2 (73; 1.0) 


If I. receipts from current transfers > 0.440 and <= 0.481 and I. significance of current receipts <= 0.573 


Rule 5 (222; 0.995) 


If I. receipts from current transfers > 0.533 and <= 0.603 and I. significance of current receipts <= 0.706 


Rule 7 (408; 1.0) 


If 1. receipts from current transfers > 0.603 and I. significance of current receipts <= 0.861 and I. fiscal pressure <= 


77,070,262 
RULES FOR TR=4 


Rule 3 (181; 0.95) 


If 1. receipts from current transfers <= 0.230 and I. significance of current receipts > 0.750 and I. fiscal pressure > 


111,391,233 and <= 576,599,338 


Rule 4 (222; 1.0 


If 1. receipts from current transfers <= 0.230 and I. significance of current receipts > 0.750 and I. fiscal pressure > 


576,599,338 
Rule 6 (99; 0,949) 


If 1, receipts from current transfers > 0.230 and <= 0.289 y 15) I. significance of current receipts > 0.750 and I weight of 


GF > 780.623 and <= 2206.967 


Rule 10 (133; 0,992, 


IfT. receipts from current transfers > 0.289 and <= 0.329 and I. fiscal pressure > 77,070,262 and I. capital funding <= 


0.544 and I. significance of current receipts > 0,861 


Table 6. Estimation of risks with the total for BR 
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Goodness of the Element 
of Financial Condition 


To illustrate the goodness of the rule of the evalu- 
ated element, the following matrix of incorrect 
classification shows the cases correctly and incor- 
rectly classified by the general model (See Table 6). 

The total risk, that is, the sum of all the risks 
from all the terminal nodes (rules) is 34.08%, and 
this is representative of the percentage of cases 
classified incorrectly when all the model rules are 
used for classification or prediction, and this also 
enables us to determine the overall level of con- 
fidence provided by the model (65.02%). The 
error rate is much lower than the initial 75% which 
is found without sample segmentation (the 75% 
represents the proportion of cases that do not 
belong toa specific selected category). Therefore, 
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Correct 4087 (65.02%) 


Incorrect 2113 (34.08%) 


the rule model does contribute explanatory and 
predictive capacity. 

However, as our interest mainly lies in the rules 
for BR=4 and BR=1, and in particular for the ones 
described above, within each of these categories, 
if we make our prediction using these rules ex- 
clusively, the error rate is reduced considerably. 
Thus, Table 7 shows that for the six rules selected 
for BR=4, with a sampling support of 620 local 
authorities, the probability of an accurate predic- 
tion increases to 97.5% (confidence or response), 
which is equivalent to an index of 389.98%, i.e. 
almost four times higher than with the 25% of the 
total sample (the percentage of local authorities 
with BR=4 in the not segmented sample). In other 
words, 620 local authorities presented the above- 
stated levels of variables for the six rules of BR=4, 
and of these authorities, 97.5% achieved high 
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Table 7. Gain, index and response with the rules selected for BR=4 and BR=1 


389.98 
380.61 


Setar Ena) cm i a” 


6 Rules Decile BR=1 


Real/Predicted STS=1 STS =2 STS =3 oo 


STS 


1141 


levels of sustainability (BR=4). This also means 
that the model presented accounts for 39% (gain) 
ofall the local authorities in the total sample with 
a sustainability of BR=4. The level of confidence 
in the six rules defined for BR=4 is thus 97.5%, 
while the individual level of confidence for each 
of these rules is as shown in Figure 2. 

This table also shows the level of confidence 
for each of the six rules obtained for BR=1, with 
the corresponding gain and index indicators, for 
each of which similar goodness analyses could 
be made. The charts in Appendix (i.e. Appendix 
2) illustrate the gains, responses and indices for 
the set of rules obtained for the classification of 
the category BR=4. Note that for the tenth per- 
centile, the values coincide with those given in 
Figure 2. 

The Responses Chart indicates the level of 
confidence in the rules; thus, for example, for 
the rules addressed in 10% of the sample (the 
highest decile), the level of confidence in them 
is 97.5%. The higher the level of the chart with 
respect to the 25% benchmark (the confidence in 
the prediction for the category BR=1, using the 
not segmented sample), the higher the model’s 
predictive and classificatory capacity. 


Correct 3911 (63.08%) 


Incorrect 2289 (36.92%) 


With respect to the second of the elements that 
constitute financial condition, the estimation of 
risks correctly classifies 63.08% of the cases, in 
contrast to the level of risk assumed in the un- 
segmented model (75%). Hence, it improves the 
prediction and classification results by reducing 
this risk to 36.92% (See Table 8). 

With respect to the risks for the two categories 
of greatest interest (STS=1 and STS=4), the rate 
of accurate prediction rises considerably, with the 
response (confidence) rate reaching 99.6%, with 
a gain of 39.84%, while the index value ap- 
proaches 400% for the case of local authorities 
with the highest level of short-term solvency. For 
the case of the authorities with the poorest short- 
term solvency, the index of response is lower than 
for those situated in the higher quartile, but in 
general the model produces very satisfactory 
classifications (See Table 9). 

For the element of flexibility, concerning the 
goodness of fit, note that the risk continues to 
decrease and that accurate predictions are made 
for 64.19% of cases, this value being similar to 
that obtained with the two previous models (See 
Table 10). 

On examination of the rules that represent the 
best and the worst results related to the entity’s 


37 


Measuring the Financial Crisis in Local Governments through Data Mining 


Table 9. Gain, index and response with the rules selected for STS=4 and STS=1 


4 Rules Decile STS=4 


8 Rules Decile STS=1 i ae 


Real/Predicted 


flexibility, we see that, once again, the success 
rate exceeds 91%. Both the gain and the index 
present results similar to those achieved with the 
models measuring sustainability and short-term 
solvency (See Table 11). 

With respect to the goodness of the element 
vulnerability, the following table shows that this 
model produces the greatest reduction in risk, 
with success rates exceeding 81% (See Table 12). 

Finally, the goodness of this element in the 
principal profiles within quartiles of 1 (upper 
decile of rules for TR=1) and 4 (upper decile of 
rules for TR=4), we see that the model achieves 
a similar level of response to the others, with the 
local authorities in the highest decile being lo- 
cated at 97.7%, while those in the lowest decile 
even higher (99.9%) (See Table 13). 

We now show the charts illustrating the good- 
ness of the modelling of the different elements 
that make up the financial condition (see Appen- 
dix 2). The Gain Index Chart is interpreted in a 
similar way, with the model presenting better 
goodness as the curve is higher. For example, the 
same 97.5% confidence in the rules for the high- 
est decile represents a probability of accurate 
prediction 3.89 times higher than the initial 25% 
corresponding to the not segmented sample. The 
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Gain 


RISKS Correct 3980 (64.19%) 


Response (%) 


Index (%) 


346.34 


86.58 


= 
pa 
ll 


1075 
Incorrect 2220 (35.81%) 


charts below illustrate the gain, the response and 
the index values for the whole set of rules obtained 
by the model for the success category (STS=4). 
In all three figures, the gain of the curve above 
the initial slope reflects the substantial improve- 
ment in predictive capacity achieved from apply- 
ing the rules obtained. The following charts show 
the behaviour with respect to the different per- 
centiles. The three figures show the elevation of 
the curve above the initial slope, reflecting the 
substantial improvement in predictive and ex- 
planatory capacity achieved with the use of the 
rules obtained. It is only shown for rules obtained 
for FL=1, the most relevant category. Chart 4 
illustrates the Gain, Index and Response values 
for the set of rules for TR=4. Again, it can be seen 
that the rules model obtained makes an important 
contribution to prediction for the financial inde- 
pendence of local authorities. 

Insummary, we can conclude that the variables 
that most influence the Index of non financial 
budget result are the current margin between 
budget receipts and expenses, and the levels of 
capital expenditure. The variables which have 
most influence on a local authority’s capacity 
to manage its level of debt are related to the 
financial costs per capita that must be borne, the 
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Table 11. Gain, index and response with the rules selected for FL=4 and FL=1 


Table 12. Estimation of risks with the total model for TR 


ied ia 


Correct 5035 (81.21%) Incorrect 1165 (18.79%) 


impact of fixed charges on the entity’s funding 
structure and its current receipts. Thus, the level 
of financial independence of an entity depends, 
in most cases, on the levels of transfers received 
from higher levels of the public sector, and on the 
weight attained by current receipts in the sum of 
the entity’s total receipts. 


CONCLUSION AND DISCUSSION 


The detection and rectification of financial crises 
in local authorities is of fundamental interest for 
public-sector managers. Nevertheless, in decid- 
ing whether a local authority has managed well 
or badly, it is necessary to take into account a 
series of external factors that are influential in this 
respect. In general, for all countries, the proposed 


model represents an advance in the maximization ` 


of benchmarking, which is an essential process in 


public-sector management. In general, a control 
system of these characteristics makes it possible 
to advise different types of users of the existence 
of financial tensions; such users might include 
public-sector managers in authorities responsible 
for supervising the financial situation of town and 
city councils, or senior officers in such councils 
who need to know how resources are being man- 
aged, and how this is done incomparable councils. 

In order to determine whether a local authority 
is experiencing a financial crisis, we consider the 
concept of financial condition, which is measured 
by means of different elements, including short- 
term solvency (the capacity to generate liquidity 
in the immediate future) and budgetary solvency 
(the capacity to meet budgetary obligations). This 
concept can be divided into other, more specific, 
aspects, such as those of flexibility, sustainability 
and vulnerability (Greenberg and Hillier 1995; 
CICA 1997). Finally, long-term solvency is mea- 
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sured through the incorporation of a considerable 
period of time into the indicators considered. 

However, there is a problem concerning the 
measurement of the elements that constitute 
financial condition, namely the non-existence 
of an instrument that can be used to measure the 
different aspects that make it up, bearing in mind 
the large body of variables that can be applied for 
this purpose, as well as the need to take a long- 
term view. We propose a means of overcoming 
this problem, by applying data mining using the 
CHAID algorithm. This methodology enables us 
to create non-binary decision trees, with multiple 
branches for each node, providing occurrence 
probabilities via exclusionary rules, and is espe- 
cially suitable for large sample sizes, for which, 
in principle, no model has yet been established 
for the phenomenon in question. The financial 
condition, in the terms defined in the present 
chapter, provides the characteristics necessary 
for such an application. 

The results obtained from applying the above 
methodology to evaluating financial indepen- 
dence, short-term solvency, flexibility and bud- 
getary sustainability are highly satisfactory. The 
models derived, forall the Spanish local authorities 
analyzed, produced a success rate of over 63%, 
while in the case of financial independence, over 
80% accuracy was achieved. Clearly, the model 
developed presents a high degree of explanatory 
and predictive capacity. 

For the specific cases of the worst and best 
values, i.e. the first and fourth quartiles for each 
of the elements of financial condition analyzed, 
an even higher rate of accuracy was recorded, 
ranging from 86% (for the case of the local authori- 
ties with the worst situation regarding short-term 
solvency) to 99.9% (for those authorities with the 
highest levels of financial independence). The 
results also suggest that the characterization of 
the financial condition by means of four models 
is a good method, as the main rules created by 
means of the different decision trees are made up 
of variables that differ depending on the element 
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to be analyzed. Thus, for the levels of budgetary 
sustainability, the most significant variables are 
those related to the current margin (gross sav- 
ings), together with the importance of capital 
expenditure in the budgetary structure, while on 
the other hand, the short-term solvency depends 
on the liquid funds possessed by the entity, on 
the time elapsing before payments are made and 
received, and on the fixed charges to be met. The 
flexibility, however, depends on the financial load 
per inhabitant of the municipality, on the total sum 
of fixed charges, and on certain variables related 
to the implementation of current receipts. Finally, 
financial independence depends fundamentally on 
the transfers that the entity receives (an aspect that 
is predictable) and on the fiscal pressure, among 
other elements. 

On the basis of the results reported here, it 
would be useful in the future to include other lines 
of research based on the introduction of variables 
concerning the social and economic context, as 
well as variables related to the way in which 
public services are managed, as these factors 
influence the characterization of local authorities’ 
financial behaviour. Furthermore, we recommend 
the consideration of other algorithms, within the 
data mining method, that could make it possible 
to achieve higher success rates and thus reduce 
the risks involved, by considering all the local 
authorities in question in order to classify and 
predict financial behaviour in local government. 

From the methodological point of view, it 
would be appropriate to apply other algorithms 
to compare the stability and prediction power of 
the model created, in particular, the advanced 
version C5.0 (Chesney, 2009), which improves 
how missing values are dealt with. In addition, we 
are aware that the automatic discretization of the 
continuous explanatory variables could representa 
strongly impacting pre-processing statement, one 
that might not be necessary in certain other tree 
algorithms. However, since our goal is to mea- 
sure the four elements of the financial condition, 
such an extension of the study would lead to the 
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word limits of a book chapter being exceeded. 
Therefore we have focused on the implementa- 
tion of the CHAID method to each of the above 
four elements, to obtain preliminary results as a 
starting point for future research on which we 
are currently working, such as the use of Neural 
Networks or the Support Vector Machine (SVM). 
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KEY TERNS AND DEFINITIONS 


Data Mining: (also called data or knowledge 
discovery): The process of analyzing data from 
different perspectives and summarizing it into use- 
ful information by finding correlations or patterns 
among multiple fields in large relational databases. 

CHAID: A decision tree technique used for 
classification ofa dataset. It provides a set of rules 
for application to a new (unclassified) dataset to 
predict which records will have a given outcome. 
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Rule Induction: The extraction of useful 
if-then rules from data based on statistical sig- 
nificance. 

Local Government: An administrative body 
that may span one or several geographic areas. It 
may also refer to a city or village. 

Financial Condition: The ability of a gov- 
ernment body to provide services and to meet its 
future obligations. This concept can be measured 
by considering the situation of its net assets, its 
budget balance or the net cash position. 

Flexibility: An entity’s capacity to respond 
to changes in the economy or in its financial cir- 
cumstances, within the limits of its fiscal abilities. 
This capacity is reflected in the degree to which it 
is able to react to such changes, via public debt. 

Sustainability: An organization’s ability to 
maintain, promote and protect the social welfare 
of the population, employing the resources at its 
disposal. 

Independence: An organization’s level of 
dependence on the external funding received via 
transfers and grants. 

Short-Run Solvency: An entity’s ability to 
generate sufficient liquidity to pay its short-term 
debts. 

Long-Run Solvency: A government’s abil- 
ity to respond adequately to meet its long-term 
obligations. 


ENDNOTES 


N 


Each rule is derived from a particular route 
defined by the tree, until each terminal node 
(t) is reached. Therefore, there are as many 
rules as there are terminal nodes in the tree. 
F-Test. This test is based on the ratio of the 
variance between the groups and the vari- 
ance within each group. If the means are the 
same for all groups, you would expect the F 
ratio to be close to 1 since both are estimates 
of the same population variance. The larger 
this ratio, the greater the variation between 
groups and the greater than chance that a 
significant difference exists (See Ipifia, S. 
Inferencia estadística y análisis de datos. 
Madrid. Pearson. 2008). 

The population segmentation, carried out 
taking into account the different levels of 
the explanatory variables, produces a global 
model that is structured as a tree, with a large 
number of rules or local authority profiles, 
although forthe purposes of the present study 
only the most important have been selected. 
It is not necessary to describe the main rules 
obtained for BR=2 and BR=3, as it is the 
extreme quartiles, indicative of success and 
failure profiles, that are the most interest- 
ing and useful. For the same reason, these 
quartiles are also omitted for the other three 
models examined in this study. 
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APPENDIX 1 (FIGURE 6, FIGURE 7) 


Figure 6. Indicators for analysis of the financial factor in the financial condition (1) 


VARIABLE FORMULATION 

Index of implementation of expenditure Net recognized obligations / Definitive credits 

Index of implementation of current expenditure Current budget net recognized obligations / Current budget 
definitive credits 

Index of implementation of capital expenditure Capital budget net recognized obligations / Capital budget 
definitive credits 

Index of implementation of receipts Net extinguished receivables / Final previsions 

Index of implementation of current receipts Current budget net extinguished receivables / Current budget 


final previsions 


Index of implementation of capital receipts Capital budget net extinguished receivables / Capital budget 
final previsions 

Index of current budget payments Current budget liquid payments made / Net recognized 
obligations 

Index of current budget receipts Current budget payments received / Net extinguished 
receivables 

Index of public expenditure per capita Net recognized obligations / No. of inhabitants 

Index of current expenditure per capita Net recognized obligations Chaps. I-IV / No. of inhabitants 

Index of significance of current expenditure Net recognized obligations Chaps. 1-1V / Net recognized 
obligations 

Index of significance of current receipts Current net extinguished receivables / Net extinguished 
receivables 

Index of taxation receipts Extinguished receivables Chaps. I-III / Current net 
extinguished receivables 

Index of gross savings Gross savings / Current net extinguished receivables 

Index of capital expenditure per capita Net extinguished receivables Chaps. V1-VII / No. of inhabitants 

Index of significance of capital expenditure Net extinguished receivables Chaps. VI-VII / Net recognized 
obligations 

Index of capital funding Net extinguished receivables Chaps. VI-VII / Recognized 
obligations Chaps. VI-VII 

Index of net savings Net savings / Current net extinguished receivables 

Index of significance of financial load Net recognized obligations Chaps. III and IX / Net recognized 

obligations 

Index of financial load per capita Net recognized obligations Chaps. lH and IX / No. of 
inhabitants 

Index of accumulated debt per capita Debt balance of the corporation per capita 

Index of indebtedness over current receipts Outstanding debts owed at year end / Current receipts 

Index of the weight of financial load Net recognized obligations Chaps. III and IX / Current net 
recognized obligations 

Index of immediate liquidity Liquid funds / Obligations pending payment 

Index of short-term solvency Liquid funds and obligations pending receipt / Obligations 
pending payment 

Mean receipts lapse (Obligations pending receipt / Net extinguished obligations) x 

3 365 

Mean payment lapse (Obligations pending payment / Net recognized obligations) x 
365 

Index of significance of cash surplus General expenses cash surplus / Obligations pending payment 

Index of year-end liquidity Difference between current budget payments received and paid 

Index of current financial independence Current recognized obligations / Recognized receivables 
Chaps. I-I and V 

Index of total financial independence Net recognized obligations / Recognized receivables Chaps. I- 
IH, and V, VI, VIII and XI 

Index of non financial budget result Net recognized obligations Chaps. I-VH / Net recognized 
receivables Chaps. I-VII 

Index of fiscal pressure Net recognized receivables Chaps. 1-H per capita 

Index of current transfer receipts Recognized obligations Chaps. I-III / Recognized obligations 


Chaps. I-IV 
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Figure 7. Indicators for analysis of the financial factor in the financial condition (2) 


VARIABLE FORMULATION 

Index of gross savings (Net recognized receivables Chaps. I-IV — Net recognized 
obligations Chaps. I, I] and IV) / Net recognized receivables 
Chaps. I-IV 

Index of income results Current budget receivables pending payment / Total 
receivables pending payment 

Index of expenditure results Current budget obligations pending payment / Total obligations 
pending payment 

Margin of current self funding Recognized receivables Chaps. I-IV and IX / Recognized 
receivables Chaps. I-V 

Index of available cash (Current budget receipts — Current budget payments made) / 
Net recognized obligations 

Index of staff costs Recognized obligations Chap. I / Current recognized 

pa a ee ee 

Index of current transfers effected Recognized obligations Chap. IV / Current recognized 
obligations 

Index of staff costs per capita Recognized obligations Chap. I per capita 

Index of expenditure on goods and services, per capita Recognized obligations Chap. II per capita 

Index of financial expenditure per capita Recognized obligations Chap. III per capita 

Index of investment per capita Recognized obligations Chap. VI per capita 

Index of coverage of financial load Margin of current receipts (Income Chaps. I-V — Expenses 
Chaps. I-IV) / Financial payments (Expenses Chaps. III and 
IX) 

Index of fixed charges Recognized obligations Chaps. I-I and IX / Recognized 
receivables Chaps. I-V 

Index of coverage of expenditure Total recognized receivables / Net recognized receivables 

Sum of current receipts Sum of recognized receivables Chaps. I-V 

Index of current receipts per capita Recognized receivables Chaps. I-V / No. of inhabitants 

Budget chapters of expenses and receipts 

EXPENSE BUDGET INCOME BUDGET 

Chapter I: Staff costs Chapter I: Direct taxes 

Chapter IT: Goods and services Chapter II: Indirect taxes 

Chapter III: Financial costs Chapter III: Fees and public charges 

Chapter IV: Current transfers Chapter IV: Current transfer receipts 

Chapter V1: Investment costs Chapter V: Patrimonial receipts 


Chapter VII: Capital transfer costs Chapter VI: Sales of real investments 


Chapter VIII: Financial asset costs Chapter VII: Current transfer receipts 


Chapter IX: Financial liability costs Chapter VIII: Receipts from financial assets 


Chapter IX: Receipts from financial liabilities 


45 


Measuring the Financial Crisis in Local Governments through Data Mining 


APPENDIX 2 (FIGURE 8) 


Figure 8. Gain, Index and Response with the Rules Obtained for BR=4 (number 1); for STS=4 (number 
2); for FL=1 (number 3); for TR=4 (number 4) 
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ABSTRACT 


The chapter exposits the strategies employed by the public long-term care systems operated by each U.S. 
state government. The central technique employed in this investigation is fuzzy decision trees (FDT), 
producing a rule-based classification system using the well known soft computing methodology of fuzzy 
set theory. It is a timely exposition, with the employment of set-theoretic approaches to organizational 
configurations, including the fuzzy set representation, starting to be discussed. The survey details con- 
sidered, asked respondents to assign each state system to one of the three ‘orientations to innovation’ 
contained within Miles and Snows’ (1978) classic typology of organizational strategies. The instigated 
aggregation of the experts’ opinions adheres to the fact that each long-term care system, like all orga- 
nizations, is “likely to be part prospector, part defender, and part reactor, reflecting the complexity of 
organizational strategy”. The use of FDTs in the considered organization research problem is perti- 
nent since the linguistic based fuzzy decision rules constructed, open up the ability to understand the 
relationship between a state's attributes and their predicted position in a general strategy domain - the 
essence of data mining. 


INTRODUCTION has been given to the potential for soft computing 
frameworks to provide flexible information pro- 
With data storage increasing at a phenomenal rate, cessing capability that can exploit the tolerance of 
traditional ad hoc mixtures of data mining tools are imprecision, uncertainty, approximate reasoning, 
no longer adequate. In one response, some attention and partial truth in knowledge discovery (Mitra et 
al., 2002). This chapter extends that line of enquiry 
DOI: 10.4018/978-1-60566-906-9.ch003 
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by providing an early and detailed exposition of 
the data mining potential of a soft computing 
methodology that is based on fuzzy set theory, 
henceforth FST (Zadeh, 1965). 

Since its introduction in 1965, FST is closely 
associated with uncertain reasoning and is the 
earliest and most widely reported constituent of 
soft computing (Mitra et al., 2002). Of particu- 
lar interest in this exposition of data concerning 
public policy and strategy, FST incorporates op- 
portunities to develop techniques that incorporate 
vagueness and ambiguity in their operation, and it 
allows outputs to be presented in a highly readable 
and easily interpretable manner (Zhou and Gan, 
2008). While data mining encompasses the typi- 
cal tasks of; classification, clustering, association 
and outlier detection, here its role in rule-based 
classification is considered. 

Previous FST-based research in organizational 
and policy contexts is limited but includes: ex- 
plaining constitutional control of the executive of 
parliamentary democracies in US states (Pennings, 
2003), and the evaluation of knowledge manage- 
ment capability of organizations (Fan etal., 2009). 
Ragin and Pennings (2005) give a discussion of 
FST in social research, in their introduction to a 
special issue of the journal Sociological Methods & 
Research. This acknowledges the need to continu- 
ally validate this new methodology (FST), through 
its continued application. A pertinent study by Fiss 
(2007), considered the whole issue of the employ- 
ment ofa set-theoretic approach to organizational 
configurations, including the progression from a 
crisp to fuzzy set representation, and the latter’s 
potential for undertaking appropriate analysis. 

The context of the exposition presented in this 
chapter is a study of the strategies employed by 
the public long-term care systems operated by 
each U.S. state government. The main dataset 
was collected from a survey of experts in this area 
(including academics, government officials, and 
service providers). The survey asked respondents 
to assign each state system to one of the three 
‘orientations to innovation’ contained within 
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Miles and Snows’ (1978) classic typology of 
organizational strategies: prospectors, defend- 
ers, and reactors. Briefly, these strategic groups 
describe different orientations to strategy from 
the more consistently innovative prospectors, to 
reactors that typically innovate only after coer- 
cion. The instigated aggregation of the experts’ 
opinions adheres to the fact that each long-term 
care system, like all organizations, is “likely to 
be part prospector, part defender, and part reac- 
tor, reflecting the complexity of organizational 
strategy” (Andrews et al., 2006). 

In this chapter, the aggregated expert assign- 
ments are assessed using a fuzzy decision tree 
(FDT) analysis of state long-term care system 
characteristics. The pertinence of this analysis 
is that, with the federal system existing in each 
U.S. state, the decision rules constructed are in 
respect of the state’s governing organization’s 
managementattitudes to healthcare. FDT isadata 
mining technique which benefits from the general 
methodology FST (Yuan and Shaw, 1995; Mitra 
et al., 2002). The overriding remit of, decision 
trees, within crisp and fuzzy environments, is 
with the classification of objects described by a 
data set in the form of a number of condition and 
decision attributes. 

Adecision tree, in general, starts with an identi- 
fied root node, and paths are constructed down to 
leaf nodes, where the attributes associated with 
the intermediate nodes are identified through a 
measure to preferentially gauge the classification 
certainty of certain objects down that path. Each 
path down to a leaf node forms an ‘if. then..’ de- 
cision rule, used to classify those objects whose 
condition attribute values satisfy the condition 
part of that rule. Beyond FDT, other rule based 
classification methods include, amongst others, 
RIPPER (Cohen, 1995; Thabtah ef al., 2006) and 
rough set theory (Beynon et al., 2000). 

The development of decision trees in a fuzzy 
environment furthered the readability of the now 
constructed ‘if. then..’ fuzzy decision rules (Zhou 
and Gan, 2008). The potential appropriateness of 
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FDTs in organization research can be gauged from 
the recent statement by Li et al. (2006, p. 655); 


“Decision trees based on fuzzy set theory combines 
the advantages of good comprehensibility of deci- 
sion trees and the ability of fuzzy representation 
to deal with inexact and uncertain information.” 


Their findings also highlight that FDTs may 
not need extensive data to operate or intensive 
computing powers, pertinent to the analysis un- 
dertaken in this chapter. 

The specific FDT technique employed here 
was presented in Yuan and Shaw (1995) and 
Wang et al. (2000). It attempts to include the 
cognitive uncertainties evident in the data values 
(condition and decision attribute values), in the 
creation of concomitant sets of fuzzy ‘if.. then..’ 
decision rules, whose condition and decision parts, 
using concomitant attributes, can be described in 
linguistics terms (such as low, medium or high). 
This FDT technique has been used in Beynon et 
al. (2004a) to investigate the audit fee levels of 
companies, and Beynon et al. (2004b) to investi- 
gate the songflight of the Sedge Warbler. 

The use of FDTs in the considered organization 
research problem is pertinent since the linguistic 
based fuzzy decision rules constructed open up 
the ability to understand the relationship between 
a state’s attributes and their predicted position in 
a general strategy domain - the essence of data 
mining. In general, linguistic variables are often 
used to denote words or sentences of a natural 
language (Zadeh, 1975a, 1975b, 1975c). Its 
utilisation is appropriate for data mining where 
information may be qualitative, or quantitative 
information may not be stated precisely (Wang 
and Chuu, 2004; Fan et al., 2009), often the case 
in organizational and policy research. 

The contribution of this book chapter is the 
clear understanding of the advantages of the uti- 
lization of FDTs in data mining in organizational 
and policy research, including; the formulation 
of attribute membership functions (MFs), fuzzy 


decision tree construction, and inference of pro- 
duced fuzzy decision rules. A small hypothetical 
example will also be included to enable the reader 
to comprehend the included analytical rudiments 
of the technique employed, and the larger previ- 
ously described application demonstrates the 
potential interpretability allowed through the use 
of this data mining approach. 


BACKGROUND 


The background of this chapter covers; the ru- 
diments of fuzzy set theory, the fuzzy decision 
tree (FDT) approach considered, and a tutorial 
presentation on the application of FDTs ona small 
example data set. 


Fuzzy Set Theory 


In fuzzy set theory (Zadeh, 1965), a grade of mem- 
bership exists to characterise the association of a 
value x to a set S. The concomitant membership 
function (MF), defined u (x), has range [0, 1]. The 
domain of a numerical attribute can be described 
by a finite series of MFs that each offers a grade 
of membership to describe a value x, which form 
its concomitant fuzzy number (Kecman, 2001). 
Further, the finite set of MFs defining a nu- 
merical attribute’s domain can be denoted a lin- 
guistic variable (Herrera et al., 2000). Zadeh 
(1975a-c) offer an early insight on the concept of 
a linguistic variable, where each MF, within a set 
of MFs, denotes a linguistic term. Different types 
of MFs have been proposed to describe fuzzy 
numbers, including triangular and trapezoidal 
functions. Yu and Li (2001) highlight that MFs 
may be, advantageously, constructed from mixed 
shapes, supporting the use of piecewise linear 
MFs (see also Dombi and Gera, 2005). The func- 
tional form of a piecewise linear MF, (in the 
context of the /" linguistic term T * ofa linguis- 


tic variable 4,), is given through a visual repre- 
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sentation in Figure 1, which elucidates their 
general structure. 

The general form ofa MF presented in Figure 
1 shows how the value of a MF is constrained 
within 0 and 1. In Figure 1, a piecewise triangu- 
lar MF is shown, based on five defining values 
[a > O99 Fis, Ay a]. The implication of these 
specific defining values is also illustrated, includ- 
ing the idea of associated support, [a> a] in 
Figure 1. Further, the notion of dominant support 
can also be considered, where a MF is most 
closely associated with an attribute value, the 
domain [a,.» a, ,] in Figure 1. 

These definitions of support and dominant sup- 
port, along with the defining values, are closely 
associated with the commonly used concept of the 
a-cut, in particular the defining values are when 
a equals 0, 0.5 and 1 (Kovalerchuk and Vityaev, 
2000). Moreover, the issue becomes the assign- 
ment of values to the defining values, to enable 
the creation of the MFs required to describe a 
numerical attribute. As Fiss (2007) suggests, talk- 
ing about FST in an organization research context, 
FST is a superior way (over crisp set theory) of 
offering substantive knowledge on a numerical 
attribute, with meaningful values required for the 
defining values. 

Beyond this technical exposition of the rudi- 
ments of FST, and general positive elucidation 
of this methodology presented in the introduc- 
tion, Ragin and Pennings (2005, p. 425) present 
four claims on the applicability of FST to social 
research: 


1. FST permits a more nuanced representa- 
tion of categorical concepts by permitting 
degrees of membership in sets rather than 
binary in-or-out membership. 

2. FST canbeused to address both diversity and 
ambiguity in a systematic manner, through 
set calibration and set-theoretic relations. 

3. More verbal theory in the social sciences 
is formulated explicitly in set-theoretic 
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terms. The FST approach provides a faithful 
translation of such theory. 

4. FST enables researchers to evaluate set- 
theoretic relationships such as intersection 
and inclusion and, thereby, necessity and 
sufficiency. Set theoretic relationships are 
very difficult to evaluate using conventional 
approaches such as the general linear model. 


Aspects of these claims will become apparent 
in the technical description of the FDT technique 
employed (next given), and in its employment in 
asmall example, and larger state strategy problem 
later given. 


Fuzzy Decision Trees 


Following the realization of the decision tree ap- 
proach to data mining in the 1960s (Hunt et al., 
1966), the introduction of fuzzy decision trees 
(FDTs) was first loosely referenced in the late 
1970s (Chang and Pavlidis, 1977), with early 
formulizations of FDTs including; derivatives of 
the well known ID3 approach (Quinlan, 1979) 
utilizing fuzzy entropy (Ichihashi et al., 1996), 
and other versions of crisp FDT techniques (see 
Pal and Chakraborty, 2001; Olaru and Wehenkel, 
2003). 

This section outlines the technical details of 
the FDT approach introduced in Yuan and Shaw 
(1995). With an inductive fuzzy decision tree, 
the underlying knowledge related to a decision 
outcome can be represented as a set of fuzzy ‘if. 
then..’ decision rules, each of the form; 


If (A, is T') and (4, is 7,7) ... and (A, is T* ) 
then D is D, 

where A,, A.,.., 4, and D are linguistic variables 
for the multiple antecedents (A,’s) and consequent 


(D) statements used to describe the considered 
objects, and 7(4,)={T;", T,.. Tf } and {D Dp» 
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Figure 1. General definition of piecewise linear MF (including the required defining values [a , a 


By Op Osp) 


JP G? 


...,D,} are their respective linguistic terms. Each 
linguistic term T: is defined by the MF p (x). 


which transforms a value in its associated domain 
to a grade of membership value to between 0 and 
1. The MFs, Hye (x) and p p, (y), represent the 


grade of membership of an object’s antecedent 4, 
being and consequent D being D, respec- 


tively. 

A MF u(x) from the set describing a fuzzy 
linguistic variable Y defined on_X, can be viewed 
as a possibility distribution of Y on X, that is z(x) 
= (x), for all x € X the values taken by the objects 
in U (also normalized so max -y T(x) = 1). The 


possibility measure E (Y) of ambiguity is defined 
by EY) = g(a) = E (n? — nt) lafi], where 1° 
i=l 


* 


Fine T eo m } isthe permutation of the 
normalized possibility distribution z = {z(x,), 


a(x), --- A(x,)}, sorted so that x; > n}, for i= 


l, ..., n, and TW), = 0. In the limit, if n} = 0, 


then E (Y) = 0, indicates no ambiguity, whereas 
if q? = 1, then Æ (Y) = In[m], which indicates all 
values are fully possible for Y, representing the 


greatest ambiguity. 
The ambiguity of attribute A (over the objects 


U,, ..., U) is given as: E (A) = LSE (Alu) ; 


where E (4(u))= g(p, (u,)/max(u, (u,))), with 


ISj<Ss ” 


T,, ..., T the linguistic terms of an attribute (an- 
tecedent) with m objects. When there is overlap- 
ping between linguistic terms (MFs) ofan attribute 
or between consequents, then ambiguity exists. 
For all u € U, the intersection 4 N B of two 
fuzzy sets is given by w,,, = min[w,(u) “,(w)]. 
The fuzzy subsethood S(A, B) measures the degree 
to which A is a subset of B, and is given by, S(A, 


B)= > min(p (u) u 0))/ X plu). Given 


ucU 
fuzzy evidence Æ, the possibility of classifying 
an object to the consequent D, can be defined as, 
mD| E) = S(E,D,) / max S(E,D,), where the 
J 


fuzzy subsethood S(E, D) represents the degree 
of truth for the classification rule (‘if £ then D,’). 
With a single piece of evidence (a fuzzy number 
for an attribute), then the classification ambigu- 
ity based on this fuzzy evidence is defined as: 
G(E) = g(z(D| £)), which is measured using the 
possibility distribution z(D| £) = (a(D | Ð), ..., 
mD,| B)). 

The classification ambiguity with fuzzy par- 
titioning P = {E ..., Æ} on the fuzzy evidence 
F, denoted as G(P| F), is the weighted average of 
classification ambiguity with each subset of par- 


k 
tition: G(P| F) = X w(E, | F)G(E, N F), where 
i=l 


G(E, N F) is the classification ambiguity with 
fuzzy evidence £, N F, and where w(E|| F) is the 
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Table 1. Example small data set 


weight which represents the relative size of sub- 
set E, N Fin F: w(E| F)= 


5 min(e, 0), 0) / EE minga, (W) he) 


ucU 


In summary, attributes are assigned to nodes 
based on the lowest level of classification am- 
biguity. A node becomes a leaf node if the level 
of subsethood is higher than some truth value 8 
assigned to the whole of the FDT. The classifica- 
tion from the leaf node is to the decision group 
with the largest subsethood value. The truth level 
threshold £ controls the growth of the tree; lower 
pP may lead to a smaller tree (with lower classifica- 
tion accuracy), higher £ may lead to a larger tree 
(with higher classification accuracy). 


Fuzzy Decision Tree Analyses 
of Example Data Set 


In this section a FDT analysis is described on a 
small example data set, consisting of five objects 
described by three conditions (T1, T2 and T3) and 
one decision (D) attribute, see Table 1. 

Using the data set presented in Table 1, the 
example FDT analysis next exposited, starts with 
the fuzzification of the individual attribute values. 
Throughout this analysis, and pertinent to the 
later applied FDT analysis, the level of fuzzifica- 
tion employed here is with three MFs (linguistic 
terms) designated to represent the linguistic vari- 
ables for each of the attributes, condition CRE T2 
and T3) and decision (D), see Figure 2 for the 
case of the decision attribute D. 
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In Figure 2, three MFs, H,(D) (labelled D, - 
Low), ,,(D)(D,,- Medium) andy, (D) (D,,- High), 
are shown to cover the domain of the decision 
attribute D, the concomitant defining values are, 
for D,: [-20,-90, 14, 18, 22], D: [14, 18, 22, 30, 
40] and D,,: [22, 30, 40, œ, œ]. An interpretation 
of these MFs, as mentioned, could then simply 
be the associated linguistic terms of the three MFs 
being, low (L), medium (M) and high (H). 

For the three condition attributes, T1, T2 and 
T3, their fuzzification is similarly based on the 
creation of linguistic variables, each described by 
three linguistic terms (three MFs), see Figure 3. 

The sets of MFs described in Figure 3 are each 
found from a series of defining values, in this case 
for; T1 - [[-c0,-00, 18, 26, 28], [18, 26, 28, 38, 
40], [28, 38, 40, œ, œ]], T2 - [[-c0,-00, 26, 34, 
40], [26, 34, 40, 52, 54], [40, 52, 54, œ, oo] and 
T3 - [[-00,-00, 12, 14, 18], [12, 14, 18, 23, 24], 
[18, 23, 24, œ, oo]. 

Applying these MFs, in Figures 2 and 3, on 
the example data set in Table 1, achieves a fuzzy 
data set version, see Table 2. 

In Table 2, each condition attribute, T1, T2 
and T3, is described by three values associated 
with the three linguistic terms (L, M and H). Also 
shown, in bold, is the largest of the fuzzy values 
from each triplet of MFs (associated with a single 
fuzzy variable), indicating the most dominant 
linguistic term each condition value is associated 
with (for the individual objects). The same is 
presented for the decision attribute D. 

The fuzzy data set represented in Table 2 
is suitable for its FDT-based analysis. For the 
construction of a FDT (using the FDT technique 
described earlier), the classification ambiguity of 
each condition attribute with respect to the decision 
attribute is first considered, namely the evaluation 
of the G(£) values. Before this was undertaken, a 
threshold value of £ = 0.800 was used throughout 
this construction process, associated with the 
required level of subsethood required for a node 
to designate a leaf node (see later). 
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Figure 2. Fuzzification of decision attribute D using three MFs (labeled D, - Low, D,, - Medium and 
D, - High) l 
H 
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Figure 3. Fuzzification of condition attributes, T1, T2 and T3, using three MFs in each case 
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Table 2. Fuzzy data set using three MF for each condition and decision attribute 


T1=[T1,, Ty T1] T2=[T2,, T2 T24] T3 = [T3 ; T3yT3,] D= [D,; Dy Dy] 


[00 08s, 01501 | oeras ooo | ors osasono | too, 000,000) 
C [masos 0020) | iono ooo | wooo ooo 00, | 10250, 0790, 0000, 


The evaluation of a G(E) value is shown for 
the attribute T1 (=g(z(D| T1))), where it is broken S(T1,, D) = 


down to the fuzzy labels L, M and H, so for L; ÈX min(u,,, (u) bp, (u)) / 3 Hir, (u) 
fa uel ue 
mD| T1,) = S(T1,,D,) / max S(T1,,D,), con- 
= (min(0.000, 1.000) + min(0.000, 1.000) + min(0.000, 0.750) 
sidering D $ D and D with the information in + min(0.625, 0.250) + min(0.000, 0.250)) / (0.000 + 0.000 + 0.000 + 0.625 + 0.000) 
L M H 
Table 1: = 


(0.000 + 0.000 + 0.000 + 0.250 + 0.00)/0.625 
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= 0.250/0.625 = 0.400, 


whereas S(T1,, D,,) = 1.000 and S(T1,, Dp = 
0.000. Hence z= {1.000, 0.400, 0.000}, giving 
x* = {1.000, 0.400, 0.000}, with t} = 0, then: 


G(T1,) = g(a(D| T1,)) = É (r — t) nf] 


i= 


(1.000 — 0.400) In[1] + (0.400 — 0.000) In[2] + (0.000 — 0.000) In{3] 
= 0.277, 


with G(T1,,) = 0.430 and G(T1,,) = 0.207, then 
G(T1) = (0.277 + 0.430 + 0.207)/3 = 0.3048. 
Compared with G(T2) = 0.4305 and G(T3) = 
0.3047, it follows the T3 attribute, with the least 
classification ambiguity, forms the root node in 
this case. The subsethood values in this case are; 
for T3: S(T3,,D,)=0.273, S(T3,, D,,)=0.073 and 
S(T3,, D,,) = 0.655; S(T3,,, D,) = 0.965, S(T3,,, 
Dw) =0.175 and S(T3 „» D.)= 0.000; S(T3 , D,)= 
0.659,S(T3,,, D,,)= 90.431 and S(T3,,, D,,) = 0.000. 

In each case the linguistic term with largest 
subsethood value (shown in bold), indicates the 
possible augmentation of the path. For T3,, its 
largest subsethood value is 0.655 (S(T3,, D,,)), 
below the desired truth value of 0.800 hence re- 
quires further consideration of its augmentation. 
For T3,,, its largest subsethood value is 0.965 
(S(T3,,, D,)), above the desired truth value of 
0.800, and so is a leaf node, with classification to 
D,. For T3,,, its largest subsethood value is 0.659 
(S(T3,,, D,)), hence is also not able to be a leaf 
node and further possible augmentation needs to 
be considered. 

With only three condition attributes considered, 
the possible augmentations of T3, and T3, are 
with either T1 or T2. In the case of T3,, where 
with G(T3,) = 0.334, the ambiguity with parti- 
tion evaluated for T1 (G(T3, and T1| D)) or T3 
(G(T3, and T2| D)) has to be less than this value. 
In the case of T1: 
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k 
G(T3, and TI] D)= > w(TI, | T3,)G(T3, N T1) 


Starting with the weight values, in the case of 
T3, and T1,, it follows: 


w(T1,| T3,) = 
D minn, (stn, 0) / X| E mindy, (4), 0) 


= (min(0.000, 0.375) + min(0.000, 0.000) + min(0.000, 0.000) 
k 
+ min(0.625, 0.000) + min (0.000, 1.000)) / X [ D min(py (U), Hrs w) 
j=1 (veU 1 x 


where > | È min(,, (u), Her, w)) = 1.525, so 


j=l\ueU 


w(T1,| T3,) = 0.000/1.525 = 0.000. Similarly 
w(T1,,| T3,) = 0.869 and w(T1,]| T3.) = 0.131, 
hence: 


G(T3, and T1| D) = 0.000 x G(T3,, N T1,) + 
0.869 x G(T3, N T1„) + 0.131 x G(T3, N T1,) 


= 0. 000 x 1.099 + 0.869 x 0.334 + 0.131 x 
0.366 


= 0.338, 


similarly, G(T3, and T2| D) = 0.462. With G(T3, 
and T1| D)=0.338, the lowest of these two values, 
butnot lower than the concomitant G(T3, )=0.334 
value, there is no lessening of ambiguity with the 
augmentation of either T1 and T2 to the path T3, . 

In the case T3,,, there is G(T3,,) = 0.454, and 
G(T3,, and T1| D) = 0.274 and G(T3,, and T2| D) 
= 0.462, with the lowest of these the G(T3,, and 
T1| D) = 0.274 less than G(T3,,) = 0.454, so less 
ambiguity would be found if the T1 attribute was 
augmented to the T3 = H path. The subsequent 
subsethood values in this case for each new path 
are; Tl; S(T3,, T1,, D,) = 0.400, S(T3,,N T1,, 
Dw) = 1.000 and S(T3,, N T1,, D,,) = 0.000; T1,; 
S(T3, N Tlw D,) = 0.894, S(T3, N Tl Dale 
0.319 and S(T3,, N T1,,, Da) = 0.000; T1,,: S(T3,, 
N Tlp D1) = 1.000, 11, S(T3; N Tlp DF = 
0.500 and S(T3,, N T1,,, D,,) = 0.000. 
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Figure 4. FDT for example data set with three MFs describing each condition attribute 


These subsethood results show all three paths 
end in leaf nodes (largest subsethood value above 
0.800 in each case - shown in bold), hence there 
is no further FDT construction required. The re- 
sultant FDT in this case is presented in Figure 4. 

The tree structure in Figure 4 clearly demon- 
strates the visual form of the results described 
previously. Only shown in each node box is the 
truth level associated with the highest subsethood 
value to a decision attribute linguistic term. There 
are two levels of the tree showing the use of only 
two of the three considered condition attributes, 
T1 and T3. There are five leaf nodes which each 
have adefined fuzzy decision rule associated with 
them. 

One of the rules, R1*, is indicated with a *, 
since the largest truth value associated with this 
rule is less than the 0.8 truth threshold value im- 
posed. In thecase of this path of the FDT, there was 
no further ability to augment it with other nodes 
(condition attributes), to improve the subsequent 
classification ambiguity. By their very nature, the 
fuzzy decision rules are readable (interpretable), 
for example, the rule R4 can be written as: 


R4: “If T3 = H and T1 = M then D = L (0.894)”. 


In a more readable (interpretable) form this 
rule can be further written as: 


R4: “If T3 is high and T1 is medium then D is 
low with truth level 0.894”. 


Inspection of the objects in Table 2, in terms 
of their fuzzy values, shows the object u, satis- 
fies the condition part of the rule R4, and with its 
known dominant association to D = L, this rule 
correctly classifies this object. 


FUZZY DECISION TREE ANALYSIS 
OF PUBLIC SERVICES IN THE USA 


The starting point for the analysis presented in 
this chapter is that while it is known that variation 
exists among the strategies adopted by US states’ 
long term care (LTC) systems, little analysis has 
been conducted to extend knowledge of this phe- 
nomenon. The conception of strategy employed 
here follows Miles and Snow’s (1978) formulation 
of strategic stance as resting on organizational 
orientation towards innovation that is considered 
unlikely to change substantially in the short term 
(Zajac and Shortell, 1989). The first task of this 
study is to conceptualize what strategy ‘means’ 
in the context of state LTC systems. We begin by 
introducing the distinctive context of this study 
and then argue the relevance of Miles and Snow’s 
classic conception of strategic groups. 
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Table 3. State Medicaid LTC Systems’ Strategic Stances 


Strategic Stance Response Type 


Source: Developed from Boyne and Walker (2004, p. 244) 


In contrast to the U.S. hospital sector, state 
LTC administrations deliver few services directly 
but are the primary strategic bodies with respon- 
sibility for spending Medicaid budgets devolved 
from the federal government (Crisp et al., 2003). 
State LTC administrations are, therefore, among 
the most public organizations in the U.S. because 
they are owned by state governments, funded 
by government, and subject to high degrees of 
political influence (Bozeman, 1987). This set of 
contextual features might suggest limited strategic 
‘space’ for state LTC administrations. However, 
considerable strategic choice is demonstrated 
by studies of inter-state variations in state LTC 
systems which highlight significant differences 
in policies such as provider regulation and cost 
control. 

From the early 1980s, after decades of con- 
sumer advocacy for more Medicaid resources to 
be spent on home and community-based services 
(HCBS e.g., home healthcare) as an alternative to 
institutional care provided in nursing homes, some 
states began to introduce innovative ‘rebalancing’ 
policies such as moratoria on the building of new 
nursing homes and the development of new HCBS 
programs such as personal care. Considerable 
inter-state variation persists in rebalancing efforts 
such that in 2004, Oregon spent 70 percent of 
its Medicaid LTC budget on home care services 
while Mississippi spent only 5 percent (Kitchener 
et al., 2004). 

This brief outline of the field of Medicaid LTC 
systems demonstrates that analysts recognize 
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Characteristics 
Prospector Innovate Proactive research & experimentation, innovative budgeting, attempts to colonize ‘policy 
space’ & budgets of other agencies 
Defender Consolidate Little research & experimentation, later adoption of innovations, defend existing budget and 
protect existing service/policy portfolio. 
Reactor Wait Seldom makes adjustment of any sort until forced to do so by environmental pressures e.g., 
consumer advocacy, regulation or litigation 


variation among states strategies. This ‘ground-up’ 
conception of strategy resonates well with Miles 
and Snow’s (1978) formulation of strategic stance 
as resting on organizational orientation towards 
innovation which is considered unlikely to change 
substantially in the short term (Zajac and Short- 
ell, 1989; Mouritzen, 1992). This contention is 
supported by early public service applications of 
Miles and Snow’s work (Nutt and Backoff, 1995; 
Walker and Ruekert, 1987), and much general 
management research. 

To operate their innovation-based conception 
of strategy, Miles and Snow (1978) introduce a 
typology of strategic stances that include: (1) 
prospectors that innovate early and consistently, 
(2) defenders that tend more towards stability, 
and (3) reactors that innovate little and typically 
only when coerced to do so (see Table 3). Here, 
we omit Miles and Snow’s original ‘analyzer’ 
group on the basis of previous assessments that 
it represents an intermediate category (Zahra and 
Pearce, 1990; Boyne and Walker, 2004). While 
we argue that the resulting (reduced) framing of 
the extent to which a state LTC system is an in- 
novative prospector, consolidating defender, or 
passive reactor maintains the taxonomic criterion 
of conceptual exhaustiveness, this is essentially 
a matter for empirical testing. 

Previous applications of Miles and Snow’s 
strategy framework in public service settings have 
typically used one of three approaches to assign 
organizations to strategic stances based on assess- 
ments of issues such as ‘whether strategy antici- 
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pates events or reacts to them’ and ‘orientation 
towards change/status quo’ (Wechlser and Back- 
off, 1986). The first involves experts assigning 
all organizations in the field to strategic stances 
based on perception and experience. The second 
approach involves the assigning of units to 
stances based on a variety of statistical techniques 
that compare units’ characteristics taken from 
archival sources (Shortell and Zajac, 1990). A 
third approach involves asking organizational 
participants (typically senior managers) to assess 
the strategic stance of their own organization. 
While more simple approaches result in the as- 
signment of organizations to a single strategic 
stance this misses the fact that, in reality, “they 
are likely to be part prospector, part defender, and 
part reactor, reflecting the complexity of organi- 
zational strategy” (Andrews et al., 2006). 

Each of the three approaches to strategy 
measurement outlined above rests on the basic 
assumption that when compared with other 
organizations of its type, prospectors would be 
more proactive in terms including: innovations 
in budget use and service mix (Bourgeois, 1980); 
being ‘first movers’ to new circumstances (perhaps 
indicated by innovation awards); and attempting 
to invade the policy and/or budget ‘space’ of other 
agencies (Downs, 1967). Defenders, whether in 
the absence of strategy or by conscious, would 
tend to maintain existing budget distributions and 
services, wait until innovations had been evalu- 
ated, and protect their own boundaries rather than 
seeking to colonize other agencies. Reactors would 
typically: alter existing distribution and services 
patterns only under duress, adopt innovations last, 
and be inward looking. 

Our perceptual measure of state LTC strate- 
gic stance was derived from an email survey of 
a purposive sample of experts with nation-wide 
knowledge of the field of state LTC systems. In 
June 2007, participants were asked to assign each 
state to one of the strategy groups using a basic 
instrument that provided a brief description of the 
three stances, listed the states, and asked respon- 


dents to assign each state to a stance category. 
We began by assuming that the perceptual ratings 
were ordinal and that disagreement among raters 
by one category was less serious than across two 
categories (e.g., that disagreement is greater iftwo 
respondents rated a state Prospector and Reactor, 
rather than if they if rated a state as Prospector 
and Defender). A final group of 13 raters was 
established using two criteria: (1) respondents 
who rated all states, and (2) those with an average 
agreement rate of 0.5. 

Using the judgements made by the 13 experts, 
sets of association values can be evaluated foreach 
state towards the three strategic stances. These 
association values are evaluated by the number 
of experts that categorised a state to each of the 
stances, Prospector, Defender or Reactor, divided 
by the number of experts who made a judgement 
on that state. For example, when considering 
the 13 experts, for the state KY (Kentucky), the 
respective breakdown of their judgements is, 0 to 
Prospector, 2 to Defender and 11 to Reactor, giving 
the respective association values of 0.000, 0.154 
and 0.846. Using these sets of association value 
for each state, a visual elucidation of the strategic 
stances of all the 51 states is given in Figure 5, for 
when the opinions of 13 experts were expressed. 

In Figure 5, the levels of association to the 
strategic stances, Prospector, Defender and Reac- 
tor, of a state are represented as a simplex coor- 
dinate (circle) in a simplex plot (equilateral tri- 
angle). The vertices (corners) of the presented 
simplex plot denote where there is total associa- 
tion to a single strategic stance. The dashed lines 
presented inside the simplex plot partition its 
domain to where there is largest association to 
one of the strategies (indicated by the nearest 
vertex label), as well as the ordered association. 
Also present in Figure 5 are shaded regions which 
show the area where there will be at least 50% 
association to one stance (majority association to 
one stance over the other stances). 

The results in Figure 5 show the majority of 
the states have strategic stances which are part 
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Figure 5. Association details of strategic stances 
of US states, using 13 experts’ opinions 


Prospector, Defender and Reactor. Interestingly, 
there is a dearth of states’ with mostly Prospec- 
tor and Reactor stance associations; instead there 
is a predominance of associations including the 
Defender stance. Indeed, this predominance of 
the Defender stance is shown by the breakdown 
of the states to their most dominant (largest) 
stance association; with 13 expert opinions; 14 
Prospector, 28 Defector and 9 Reactor. In the 
case of ‘50% plus majority’ association (states in 
shaded regions), the breakdown is; 13 Prospec- 
tor, 25 Defector and 9 Reactor (four states not in 
shaded regions). 

In this study, eight state LTC characteristics 
were considered (see Table 4). The first two 
measures (innovative programmes and innovate 
policies) were created specifically for this analysis. 
Both measures assign a score to each state based 
on the number of innovative HCBS initiatives 
(programs and policies respectively) operated by 
the state LTC system. The other six characteristics 
are those most commonly used in previous studies 
of variation in the performance of LTC systems 
and including measures of need (aged population 
and disability rate), service supply (nursing home 
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beds), state politics (senate voting records), and 
state government munificence (state finances). 

To enable an FDT analysis of this data, the 
fuzzification of the eight characteristics is neces- 
sary. Within an applied problem, it is important 
to undertake an understandable mechanism for 
their fuzzification (Fiss, 2007). 

The impact of the level of fuzzification of the 
characteristics directly impacts on the potential 
size of the developed FDT, where low level of 
fuzzification (low numbers of linguistic terms 
(MFs) associated with each state characteristics) 
may induce larger FDT, than if a higher level of 
fuzzification is given (high numbers of linguistic 
terms (MEFs) associated with each characteristic). 
Mitra et al. (2002) directly considers this point, 
highlighting that a smaller/compact tree is; more 
efficient both in terms of storage and time require- 
ments, tends to generalize better to unknown 
test cases, and leads to the generation of more 
comprehensible linguistic rules. 

Here, the fuzzification of the characteristics is 
next described, with the defining values [a > Q 
Q y O p Ot, 5] necessary to be found for each char- 
acteristic. For a characteristic, the list of attribute 
values was first discretized into three groups, us- 
ing equal frequency discretization (whereby each 
groups is made up of the same number of states 
(objects)). The result of this discretization is the 
evaluation of the defining values, a, and a 4 (cut 
points found from the discretization). The other 
three defining values, O19 5 and Q 5» they are the 
mean values from the attributes values in each of 
the groups previously identified. 

Based on this fuzzification process, the MFs 
associated with the linguistic terms fora linguistic 
variable form of the numerical characteristics can 
be constructed, see Figure 6. 

In Figure 6, the sets of defining values are 
shown which describe the MFs defining the con- 
comitant linguistic terms, here termed, Low, 
Medium and High (forming linguistic variable) 
for each of the eight characteristics. The explicit 
interpretation of these terms needs to be taken in 
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Table 4. State characteristics, measures, and sources 


Innovative LTC Programs Combined score from | point each for: Money Follows the Person, Cash & Counselling, 
Better jobs Better Care, National Governors Association Research grants, Demonstra- 
tion projects, Medicaid state Plan personal care optional program, Medicaid Alzhiemers 
programs, state-only funded programs, Medicaid waivers 


Combined score from | point each for: generous eligibility on Medicaid HCBS waiver 
programs; lower than average waiting list on waiver program; existence of formal Ol- 
mstead Plan; CON/moratorium on nursing home expansion; nursing home complaints; 
% long-term care; operating Medically needy eligibility policy; value of real choice 
systems change grants). Score 0-9 


[c3 | Liberal state polities =| Americans for Democratic Action, index of state senators’ liberal voting records 
T | Sa State government (revenue — expenditure) + Debt 

| 
[s | 


Innovative LTC Policies 


State Wealth Income per capita 
Institutional Bed Supply 


Nursing home beds per 1,000 population 


Figure 6. Membership functions of the linguistic terms, describing the linguistic variable forms of the 


eight characteristics describing the states 
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Table 5. Characteristic values and their fuzzification with majority linguistic term presented for KY and MN 


Fuzzy Values 


the context of the ordinal and continuous state 
characteristics. An example of the impact of this 
fuzzification process is made for two states, 
namely KY and MN, see Table 5. 

In Table 5, each characteristic value is fuzzified 
into three fuzzy values denoting their grade of 
membership to the respective number of linguis- 
tic terms, which defined the associatied linguistic 
variable. The bold value in a set of fuzzy values 
is the largest in that set, which identifies the lin- 
guistic term which there is major support for it in 
describing the state. For example, in the case of 
the state characteristic C1, with value 2 and 7 for 
the states KY and MN respectively, the fuzzy 
values can be found from inspection of Figure 6a, 
with the largest of them 0.788 and 1.000, it offers 
major support to the states being associated with 
Low C1 for KY and High C1 for MN. 

It is the series of fuzzy values representing 
the state characteristics and stance associations 
presented in Figure 5 that form the data set (fuzzy 
data set), from which a FDT analysis can be un- 
dertaken. The start of this FDT analysis is next 
briefly described. Before this was undertaken, a 
threshold value of 2 = 0.700 was used throughout 
this FDT construction process, associated with the 
required level of subsethood required for a node 
to designated a leaf node (see later). 
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Following the FDT analysis given onthe small 
example data set earlier, the start of the FDT analy- 
sis is with the evaluation of the root node (with 
characteristic), associated with the characteristic 
with the lowest G(£) value, see Table 6. 

Table 6 shows G(C1) = 0.636 is the lowest of 
the presented values, indicating the characteristic 
C1 offers least classification ambiguity, when 
their linguistic term values are considered with 
the levels of association each state has towards 
the three strategic stances describing their LTC. 
It follows, the characteristic C1 forms the root 
node of the intended FDT, with a path created 
associated with each linguistic term (Low, Me- 
dium and High), and the related subsethood 
values for each of these paths presented in Table 
7. 

In each row in Table 7, the largest value iden- 
tifies the strategic stance each of the paths from 
the root node is most associated with. For all three 
paths, amongst the largest values, highlighted in 
bold, only that associated with C1 = High (to 
Prospector) is above the minimum truth level of 
0.7 (£) desired, so that path ends at a leaf node. 
For the other two paths (C1 = Low and C1 = 
Medium), a check is required to be made to see 
if the further augmentation of the path with other 


Data Mining Using Fuzzy Decision Trees 


Table 6. Classification ambiguity values (G(-)) associated with fuzzy forms (linguistic variables) of the 


state characteristics, C1, ..., C8 


E 
0.636 0.796 0.708 


e 


Table 7. Subsethood values of C1 paths to strategy 
stances, Prospector, Defender and Reactor 


state characteristics will continue the reduction 
in classification ambiguity. 

This construction process is continued, until 
each path results in a leaf node, either because a 
required level of subsethood has been achieved or 
it is not appropriate for any further augmentation 
of characteristics. The resultant FDT is presented 
in Figure 7. 

There are nine leaf nodes in the FDT, which 
each have a defined fuzzy decision rule associ- 
ated with them. One of the rules, R6*, is indi- 
cated with a *, since the largest truth value as- 
sociated with this rule is less than the 0.7 truth 
threshold value imposed. In the case of this path 
ofthe FDT, there was no further ability to augment 
it with other nodes to improve the subsequent 
classification ambiguity. By their very nature, the 
fuzzy decision rules are readable (interpretable), 
for example, the three rules R1, R4 and R9 can 
be written as: 


R1: “If C1, C7 and C3 are Low then LTC 
Strategic Stance of a state is Prospector (0.059), 
Defender (0.428) and Reactor (0.810)” 


R4: “If C1 is Low and C7 is Medium then LTC 
Strategic Stance of a state is Prospector (0.248), 
Defender (0.907) and Reactor (0.571)” 


0.732 


R9: “If Cl is High then LTC Strategic Stance 
of a state is Prospector (0.731), Defender 
(0.449) and Reactor (0.258)” 


In Figure 5, the relative association of the 
states to the three strategic stances was exposited. 
In Figure 8 the relative associations of the fuzzy 
decision rules are considered. To achieve these 
relative associations, for each fuzzy decision rule, 
the three subsethood values associating a state to 
the three strategic stances are normalised so they 
sum to one, allowing their representations as a 
simplex coordinates in a simplex plot. 

Ineach simplex plot shown in Figure 8, a fuzzy 
decision rule is shown (represented as a star), 
along with the states that satisfy the conditions 
of that rule (using the simplex coordinates pre- 
sented in Figure 5). 

The fuzzy decision rule R1 is considered to 
exposit these results. This reports truth levels 
to Prospector, Defender and Reactor, of 0.059, 
0.428 and 0.810 (normalized to 0.045, 0.330 and 
0.625) with its largest association to the Reactor 
strategic stance. Further inspection of the simplex 
plot covering the rule R1 shows three out of the 
five states, whose majority linguistic terms for 
each state characteristic satisfy the conditions of 
this rule, are also most associated with the Reac- 
tor stance. While this is not a stipulation, all five 
states are roughly clustered in the top right hand 
region of the simplex plot. One of these states 
is KY whose fuzzified state characteristics were 
reported in Table 5, where the levels of strategic 
stance associations are given. Similar clusters 
of states are shown associated with each fuzzy 
decision rule. 
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Figure 7. FDT for US states’ health service strategy positions 


Strat = P 0.248 


Strat = P 0.380 
Strat = D 0.907 Strat = D 0.790 
Strat = R 0.571 Strat = R 0.408 


R4 R5 
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Strat = P 0.228 
Strat = D 0.862 Strat = D 0.848 
Strat = R 0,719 Strat = R 0.565 


R2 R3 


Strat = P 0.098 


FUTURE TRENDS 


While fuzzy set theory (FST), is well known in 
many fields of study, it has had limited impact in 
the area of organizational and policy research. It 
is not surprising then that the fuzzy decision tree 
(FDT) approach has not properly formally been 
applied in this area. With the resultant fuzzy ‘If. 
then..’ decision rules constructed, being readable 
and interpretable, it offers a novel way forward 
to gain inference in this area. 

It will be interesting to note how FDT and 
other alternative fuzzy approaches can be em- 
ployed within the fields of organization and 
policy research in the future. In the case of FDT 
and other FST based techniques, how pertinent 
their application is will depend greatly on how 
acceptable the readability and interpretability 
associated with FST techniques is. 
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C2 = Medium 


Strat = P 0.413 


Strat = P 0.417 
Strat = D 0.658 Strat = D 0.8 
Strat = R 0.327 Strat = R 0.610 


R6* R7 


CONCLUSION 


The recent discussion of set-theoretic approaches 
in organization research, with the inclusion of 
the understanding of fuzzy set theory (FST), is 
a demonstration of the potential for FST based 
approaches to be further employed in organiza- 
tional and policy research. The detailed discussion 
and analysis presented in this chapter presents a 
concrete example of one way in which FST can 
usefully be employed in both organizational and 
policy research. 

The fuzzy decision tree (FDT) approach de- 
scribed is, of course, only one of a number of ap- 
proaches that operate within a fuzzy environment. 
As this chapter demonstrates, however, FDTs is an 
approach to data mining that offers considerable 
potential to organizational and policy research- 
ers as it brings together a relatively well known 
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Figure 8. Simplex plots exhibiting decision rules’ position with the three strategic stances 
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technique, the crisp ‘original’ form of decision 
trees, and its development in a fuzzy environment. 
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KEY TERMS AND DEFINITIONS 


Condition Attribute: An attribute that de- 
scribes an object. Within a decision tree it is part 
of a non-leaf node, so performs as an antecedent 
in the decision rules used for the final classifica- 
tion of an object. 
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Decision Attribute: An attribute that charac- 
terises an object. Within a decision tree is part of 
a leaf node, so performs as a consequent, in the 
decision rules, from the paths down the tree to 
the leaf node. 

Decision Tree: A tree-like structure for repre- 
senting a collection of hierarchical decision rules 
that lead to a class or value, starting from a root 
node ending in a series of leaf nodes. 

Induction: A technique that infers generaliza- 
tions from the information in the data. 

Leaf Node: A node not further split, the termi- 
nal grouping, in a classification or decision tree. 

Linguistic Term: One of a set of linguistic 
terms, which are subjective categories for a lin- 
guistic variable, each described by a membership 
function. 

Linguistic Variable: A variable made up of a 
number of words (linguistic terms) with associated 
degrees of membership. 

Path: A path down the tree from root node to 
leaf node, also termed a branch. 

Membership Function: A function that quan- 
tifies the grade of membership of a variable to a 
linguistic term. 

Node: A junction point down a path in a deci- 
sion tree that describes a condition in an if-then 
decision rule. From a node, the current path may 
separate into two or more paths. 

Root Node: The node at the top of a decision 
tree, from which all paths originate and lead to 
a leaf node. 
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ABSTRACT 


The aim of this research was to study the performance of 58 Slovenian administrative districts (state 
government offices at local level), to identify the factors that affect the performance, and how these ef- 

fects interact. The main idea was to analyze the available statistical data relevant to the performance of 
the administrative districts with machine learning tools for data mining, and to extract from available 
data clear relations between various parameters of administrative districts and their performance. The 

authors introduced the concept of basic unit of administrative service, which enables the measurement 
of an administrative district's performance. The main data mining tool used in this study was the method 
of regression tree induction. This method can handle numeric and discrete data, and has the benefit of 
providing clear insight into the relations between the parameters in the system, thereby facilitating the 
interpretation of the results of data mining. The authors investigated various relations between the pa- 

rameters in their domain, for example, how the performance of an administrative district depends on the 
trends in the number of applications, employees’ level of professional qualification, etc. In the chapter, 

they report on a variety of (occasionally surprising) findings extracted from the data, and discuss how 
these findings can be used to improve decisions in managing administrative districts. 


INTRODUCTION administrative services for eight state ministries. 

These administrative tasks are commonly organised 
The aim of this research was to assess the perfor- in four departments of each administrative district. 
mance of 58 Slovenian administrative districts (state The research covered only one of them —the depart- 
government offices at local level), which provide ments for environment and spatial planning, whose 

task is to issue various permits (planning permits, 
DOI: 10.4018/978-1-60566-906-9.ch004 building permits and others) upon applications, 
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under the laws and supervision of Ministry for 
environment and spatial planning. The follow- 
ing three hypotheses were set at the beginning 
of the research: 


e The administrative districts have very dif- 
ferent productivity. 

° The level of employee education has the 
major influence on productivity. 

° The increased number of applications re- 
sults in longer times for processing. 


The analysis showed several findings of inter- 
est. Among them it was found that the organiza- 
tional productivity among administrative districts 
varied enormously, up to the ratio of 10: 1. Also, 
the number of new applications plays a major 
role in predicting the future trends in productivity. 
Level of education of employees, and to a lesser 
degree their age and gender, also influence the 
productivity. 

In our experience, machine learning methods 
proved to be a very efficient tool for quick, auto- 
matic and holistic analysis of large sets of different 
data. It was especially effective at exposing most 
characteristic patterns of behavior. According to 
our experience in this study, the analyses with 
classical statistical methods is much more rigid 
and more costly in that it requires more time for 
recognizing various hidden patterns of behavior 
such as ones generated by machine learning meth- 
ods. In this sense, machine learning is particularly 
good at data exploration stage when hypotheses 
are formulated. Of course, when we get to the 
question of proving statistical significance of 
hypotheses, then we face essentially the same 
problems as in classical statistics. 


BACKGROUND 


Among practitioners, there is unfavourable and 
prevailing general opinion (shared also by profes- 
sionals and politicians) that the work performance 
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at administrative services cannot be measured. 
The authors of this paper and an emerging group 
of innovative public managers which participated 
with data gathering and discussions during the 
present research have organized “committee for 
quality”. Members of this committee do not share 
this opinion and believe that the performance of 
these services can gradually be more systemati- 
cally measured and managed, very much like all 
those in the private sector (Asbjorn, 1995). This 
committee was a facilitator of new ideas in this 
respect. 

The main idea was to analyse the available 
statistical data (Annual reports of Administrative 
statistics 1996—1999) relevant to the performance 
of the administrative districts with tools of machine 
learning, to obtain clear relations between various 
parameters of administrative districts and the per- 
formance. These machine learning tools include 
those that are usually employed in data mining. 
Additional objective was to set the requirements 
for better performance measurement system and 
suggest the need for new ways of decision mak- 
ing by public managers in the fields of strategic 
planning and performance management (including 
performance based budgeting and performance 
based pay). 

The main data analysis tool used in this study 
was the method of regression trees, one of rather 
common machine learning techniques (Witten and 
Frank 2005). We will describe this technique in 
more detail in Section 3. 


Developing and Organizing 
Data for Analysis 


All 58 administrative districts, which employ 
more than 3000 administrative workers, provide 
administration services at local level for eight state 
ministries. The performance of these districts is 
not properly measured, monitored and thus not 
well managed. 

Data sets that were used for the analysis were 
gathered for the period of four years: 1996-99, 
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from various sources (Administrative statistics, 
and quarterly reports by Ministry of Interior; 
these are specially designed questionnaires, in- 
terviews, etc.). 


Structuring the Domain 
of Investigation 


The first step in the development of a data base 
for the analysis was to classify and standard- 
ize various different administrative services as 
much as possible. The existing official quarterly 
reports and administrative statistics regarded and 
presented various administrative services from all 
departments as equally demanding, which is far 
from reality. If we want to exercise any serious 
performance assessment, we must first establish 
the common denominator for these various ser- 
vices. In this study, one of the authors (Z.P.) carried 
out this task by conducting several meetings with 
representative experts from the departments for 
environment and spatial planning (which was the 
focus of the research). It was agreed to classify the 
administrative services into five groups. 

Also, the time needed to process different 
types of applications were defined by taking the 
average time estimates by these experts. On this 
basis, the relative ratios were calculated, taking 
as the base the group of services which demanded 
the least time — basic service). By multiplying 
each administrative service with its ratio, all the 
services provided can be expressed in units of basic 
services, which made it possible to compare the 
productivity between all five groups ofadministra- 
tive services. These ratios per one service were: 


Planning permit 1.99 (32.4 hrs) 

Building permit 1.77 (28.8 hrs) 

Permit for use 2.26 (36.8 hrs) 

Registration of construction work 1.00(16.3 
hrs) — basic unit of service 

Other permits 1.34 (21.8 hrs) 


DOWD 


a 


Data Collection and Cleaning 


The next step was to gather as much data as pos- 
sible aboutall the possible attributes (parameters) 
that may effect, or are in any way related to the 
performance of administrative districts. 

For the benefit of expressing the average and 
each single organization’s productivity in costs 
per unit of various administrative services (which 
is best understood by the customers — taxpayers; 
and should be even more by budget authorities), 
the approximate costs were estimated - 20.000 
EUR per year per worker (approximately one half 
of this amount represents the salary and another 
half material costs). 

The next task was to “clean” the data from offi- 
cial administrative statistics, by eliminating errors 
(noise) from the existing data. The most common 
errors were found when data was extracted from 
quarterly reports, where the numbers of unsolved 
cases at the end of each quarter were simply added 
together and presented in annual reports (instead 
the end of the fourth quarter only). 

The final data sets that were used for the 
analysis were organized in electronic spreadsheets 
and included the following groups of attributes: 


l. Annual structure and number of different 
classified administrative services (solved 
cases per year = “production” per year in 
% for four year period) 

2. Annual “production” expressed in units of 
basic services 

3. “Production” cost per basic service for four 
year period 

4. Number of new applications and unsolved 
cases per year for four year period 

5. Number of employees, age, gender, level of 
education for four year period 

6. Presence of performance measurement and 
stimulation measures for four year period 

7. Population served by administrative unit 

8. Various other derived indexes (trends) and 
some other performance indicators (such as 
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Table 1. Average time spent for administrative services (productivity) in years 1996-99 


denied applications or overturned decision 
on higher level) 


Then a table of classified administrative 
services was constructed. The rows in the table 
correspond to the 58 administrative districts. For 
each row in the table, there were four groups of 
columns corresponding to the four years 1996— 
1999. Eachcolumn contained 5 different classified 
services (A-E). By learning the share of different 
services of each administrative district (from the 
questionnaires) in the four years and the number 
of employees, we were able to calculate the aver- 
age production time and cost for all services in 
the given year. 

Nearly all the data sets needed some quality 
improvement which is quite usual experience 
(see Table 1). 

The experts from the administrative districts 
gave their estimated average (standard) time of 
16.30 hours per basic unit of services, while these 
calculations showed this to have been a gross 
underestimate by the factor of 2,466 (the actual 
average time for basic unit was found to be only 
6,61 hours). This additionally adds to the conclu- 
sion that the understanding of the work pro- 
cesses in administrative districts is very weak. 
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DATA MINING METHOD 
USED IN THE ANALYSIS 


For this study we have chosen to use Machine 
Learning (ML) methods (Witten & Frank 2005; 
Mitchell 1997; Weiss & Kulikowski 1991) for data 
analysis. We could have chosen for our analysis 
more traditional, classical statistical methods. 
However, we found ML tools more appropriate for 
this study. The reason is that our primary interest 
was exploratory data analysis, where we try to 
extract relations, possibly tentative and largely 
speculative, from the data. These relations that 
result from ML-based analysis should be under- 
stood as hypotheses, not necessarily as definite, 
statistically significant findings. 

Generating hypotheses is harder to do in the 
classical statistical framework than with ML 
techniques because the latter are more flexible in 
enabling automatic generation (not only testing) 
of hypotheses. For example, classical numerical 
regression methods are usually limited to linear 
regression functions, whereas regression tree 
techniques in ML typically result in non-linear 
models. Here we may also consider the so-called 
model trees with linear regression in the leaves. 
Therefore, regression trees are a much more ex- 
pressive hypothesis language than the one used 
by classical numerical regression. The class of 


The Use of Data Mining for Assessing Performance of Administrative Services 


models generated by regression tree techniques 
is much larger than that of linear regression. Clas- 
sical statistical methods, on the other hand, may 
have advantage in that they are easier for rigorous 
statistical testing of the significance ofhypotheses. 
Simply, the much larger multiplicity of possible 
hypotheses in regression trees makes it harder to 
avoid over fitting of hypotheses to data, and to 
evaluate the hypotheses’ statistical significance. 

An approach to the handling of multiplicity of 
hypotheses considered by the search for a model, 
called EVC (extreme value correction) was devel- 
oped by Mozina et al. (2006), but applying this 
technique in ML techniques is quite complicated. 
It requires an approximation of the distribution 
of accuracy over the (usually large) set of trees 
considered by a tree induction algorithm. By way 
of asummary, ML is more flexible and effective 
for generating material for further interpretation 
and research. The price for this greater flexibility 
is, of course, that there is danger of misinterpret- 
ing the results of ML. The results should usually 
be interpreted very cautiously. 

As it will become clear later, learning to make 
accurate numerical predictions is very hard in our 
domain (many attributes and small number of ex- 
amples), therefore the ability of ML techniques to 
generate more versatile models is more valuable 
than numerical prediction accuracy. ML models 
offer more ways of meaningful interpretation of 
the relations in the domain. They enable the expert 
to explore the domain, consider patterns that ap- 
pear in induced hypotheses, and gain unexpected 
insights about the domain. 

The particular ML technique used in our study 
is the learning of regression trees from data (Brei- 
man et al. 1986; Witten & Frank, 2005). In most 
of our analyses we used an early implementation 
of regression tree learning RETIS by A. Karalić 
(Karalić 1991). A more recent implementation 
of regression tree learning is part of the Orange 
system for ML(Demsar, Zupan & Leban, 2004).A 
particularly powerful practical feature of Orange 
is the intelligent data visualization module that 


can be used for initial “visual” analysis of the data 
at the stage where the “data miner” develops the 
“feel” for the data (Leban et al., 2006). 

The choice of regression trees among many 
other ML techniques was in our case justified by the 
following properties of our data mining problem: 


1. the class (or “target variable”, i.e. the vari- 
able that we want to predict) was given: the 
productivity of an administrative district 
measured as the cost of basic unit of service; 
this fact makes some other data mining ap- 
proaches, such as unsupervised learning (e.g. 
inducing association rules) less appropriate; 

2. the target variable was numerical; there- 
fore classification learning methods (such 
as decision trees, or rules) would be less 
appropriate because they would require at 
least the discretization ofthe numerical class 
variable; 

3. we were interested in detecting complex, 
non-linear patterns, which is handled by 
regression trees, but not with, say, multivari- 
ate linear regression. 


Ofcourse there are anumber of ML techniques 
that we considered as alternatives to regression 
trees. These alternatives include decision trees 
(which would require the discretization of the 
target variable), or if-then rules with either nu- 
merical predictions or, again, discretized target 
variable, or neural networks, or variants of Sup- 
port Vector Machines, etc. As explained above, 
the choice of regression trees fits the properties 
of our data mining problem the most directly and 
naturally. Discretization of the target variable, 
required by decision trees, would either lead to 
loss of information on the order of discrete values, 
or to the difficulty of non-standard decision tree 
learning with ordinal class. Our major requirement 
regarding the data mining technique also was that 
the results of learning be easy to understand and 
interpret by a human expert. This makes methods 
like neural networks less appropriate. We could 
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here continue to analyze in depth various aspects 
of the choice of the data mining method, and find 
that some alternatives would be quite viable and 
would lead to similar general findings as regres- 
sion trees. But the main contribution of this paper 
is the analysis of public administrative services 
data; therefore we will instead concentrate in this 
paper more on this application. 

In the following we describe more precisely 
how our data mining problem was formulated as a 
regression tree learning task. Induction of regres- 
sion trees can be viewed as a method for automatic 
knowledge acquisition from data. Regularities that 
are present in the analyzed data are automatically 
extracted from the data. In the context of machine 
learning, the data is interpreted as examples drawn 
from the domain of investigation, and the true 
regularities in the domain are hopefully extracted 
from the examples. In the case of learning regres- 
sion trees, the data has the form ofa table, and the 
extracted regularities are represented in the form 
of a regression tree. Each row of the data table is 
viewed as an example for learning. 

In our case study, each administrative district 
represents an example for learning. The columns of 
the data table correspond to the attributes that are 
used to describe the examples. In our case study, 
the number of employees is one of the attributes. 
There is a selected distinguished attribute, called 
the target variable, or class. A learned regression 
tree extracted from the data specifies the mapping 
from the attributes to the class. More formally, 
let the class be y, and the attributes be > Se ee 
X„ Then the learned regression tree defines a 
function y = f(x,, X,,..., X„). The attributes can 
be discrete (nominal) or continuous (numerical). 
In the case of regression trees, the class is con- 
tinuous. In our analysis, the class is typically the 
average time in an administrative district needed 
to accomplish the basic administrative task (cost 
of basic service). This is a continuous variable. 
This choice of class attribute is the most natural 
because in our analysis we are most interested 
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in the district’s performance depending on the 
attributes of an administrative district. 

Inaregression tree, there are selected attributes 
assigned to the internal nodes of the tree. The 
branches stemming from an internal node cor- 
respond to possible values of the attribute at the 
node. Class values, i.e. numbers, are assigned to 
the leafnodes of the tree. A regression tree learning 
algorithm constructs such a tree from the given 
data table in such a way that the tree minimizes 
the expected prediction error when predicting the 
class value given the attribute values. The algo- 
rithm attempts to select the most “informative” 
attributes and inserts them at the highest levels 
in the tree. The most “informative” attributes are 
those that have the highest influence on the class 
value, so they are the most important in predict- 
ing the class. 

A regression tree is used for prediction of the 
class value for a given new case as follows. We 
start at the root node and consider the attribute at 
the root. We then proceed down the tree along the 
branch that corresponds to the new case’s value 
of this attribute, which leads to the next internal 
node. At the next node, we repeat the same action. 
We thus progress down-words along a path of the 
tree until a leaf node is encountered. The class 
value assigned to that leaf is our predicted class 
value for the given case. We will show a number 
of examples of regression trees, along with their 
interpretation, in the section on the results of the 
analysis. 

In our experiments, we sometimes used the 
variation of regression trees with linear regression 
at the leaves that is model trees. In such a tree, 
a linear regression formula is assigned to a leaf 
node, instead of a single numerical value. This 
regression formula is then used to compute the 
class value (instead of just reading the class value 
at the leaf). This variety of regression tree learning 
can be viewed as structured linear regression. That 
is, the tree assigns separate regression formulas 
to different subspaces of the problem domain. 
The subspace covered by a regression formula 
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is defined by the conditions along the path from 
the root to the leaf at which the formula appears. 
Although model trees are an attractive possibility, 
in our study they generally turned out to be less 
useful than the usual, “point-value” trees, and 
model trees’ prediction accuracy was somewhat 
inferior. The reason for this is probably the fact 
that linear regression in the leaves was unreliable 
because of shortage of learning data. 

Regression tree learning programs typically 
also have the built in simplification mechanism 
called “pruning”. This mechanism prunes those 
leaves of the tree and corresponding branches that 
have low degree of significance. This simplifies 
the induced trees which makes trees less complex 
and thus easier to comprehend and to interpret 
key patterns in the trees. The pruning also tends 
to improve the predictive accuracy of regression 
trees because it largely reduces the effect of “noise” 
(errors) in the learning data. 


PERFORMANCE ANALYSIS 
WITH REGRESSION TREES 


Studying regression trees that resulted from the 
mining of administrative districts data, we can 
draw valuable conclusions about various patterns 
of behavior among 58 administrative districts 
(departments). 

It should be noted that, as usual in data min- 
ing, the mining was not done in “one shot”, but 
it was a complex, iterative process that involved 
many experiments where our data mining tool 
was applied with various parameter settings (e.g. 
degree of tree pruning) to different versions of 
data. In this process, several learning problems 
were formulated and re-formulated, and induced 
trees were interpreted by the domain expert where 
new ideas about problem formulations appeared. 
The data for various formulations of the lean- 
ing problem were refined or re-interpreted. For 
example, subsets of attributes were pre-selected 
according to the domain expert’s (Z. P.) opinion, 


and new (derived) attributes were added, such 
as “trend attributes” that indicate the changes of 
attribute values between different years. 

One interesting question is whether further 
experimentation with other parameter settings 
and reformulations of the learning task might 
lead to predictors with higher accuracy than that 
attained in this paper. On this note, a rather firm 
conclusion of our experience in this study is that 
itis unlikely that further experiments could lead to 
substantial improvements in predictive accuracy. 

Examples of induced regression trees and their 
interpretation, together with findings of interest, 
are shown on the following pages. In these experi- 
ments, we varied the class attribute and the set of 
other attributes selected in individual cases from 
the whole data table. 


Impact of Unsolved ’96 and New 
Applications 97 on Productivity ’97 


Here RETIS induced a relation between the num- 
ber of unsolved applications from the previous 
year and the current year, and the productivity 
expressed as cost of basic administrative service 
(see Figure 1). 

The generated regression tree offers the fol- 
lowing findings: 


a. The single most influential attribute is the 
number of new applications in ’97, the next is 
the number of employees, then the combined 
number of unsolved from the previous plus 
new applications from the current year, and 
the number of unsolved from the previous 
year. 

b. The cluster of 21 administrative districts on 
the left lower part of the regression tree offers 
an interesting picture. This is the group of 
larger administrative districts with more than 
8 employees in the department. The subgroup 
of 13 with more than 1346 applications per 
year has higher productivity than the other 
8 by about 25%. The further distinction in 


73 


The Use of Data Mining for Assessing Performance of Administrative Services 


Figure 1. Cost of basic service depending on the 


Classes: 
- cost per basic service 1997 (LC 97) 


aoa 


four attributes mentioned in the right top corner 
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- number of new applications'97 (St. V 97) 


- number of unsolved applications'‘96 + new 
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productivity between the two subgroups is 
the amount of unsolved cases from the pre- 
vious year. Those ones from the subgroup 
with the highest number of unsolved cases 
show about 20% lower productivity. These 
findings show the hidden rule; too many un- 
solved cases produce kind of “panic” effect 
which occurs when the number of “orders” 
(new and old applications) exceeds an op- 
timal number which appears to be around 
250 - 300. 

c. The lowest productivity (cost of basic ser- 
vice = 242.68 — 326.35 EUR) is shown by 
5 administrative districts which employ 8 or 
more workers and received up to 520 new 
applications in 1997 (1 EUR = 239 SIT). 


Relations between Productivity 
Trends and Trends of New 
Applications ’98—‘99 


Here the selected attributes and class were relevant 
for studying the trends between productivity and 
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trends in new applications (Figure 2). The find- 
ings are: For the majority of the administrative 
districts (43) it seems to hold: if the index of new 
applications is from 86 to 114 than the index of 
basic service cost is between 73 and 139. The 
tree hints inverse correlation with slightly greater 
dispersion on the cost side. In other words, this 
pattern shows direct dependency between new 
applications and productivity. This phenomenon 
is further clarified in the graph of Figure 3 where 
inverse correlation is clearly indicated. 


Productivity, Throughput and 
Education Level of Employees in 
1996 


The tree in Figure 4 shows how the cost per basic 
service is affected by the percentages of cases 
solved in 1 month, more than one month, and 
the average education level of the employees. 
The findings are: 
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Figure 2. Relations between productivity trends and trends in new applications 98-99 


Classes: Attributes: 
- Index of new applications 98/99 (1 V 8/9) - Index of cost per basic service 98/99 (I LC 8/9) 


- Index of unsolved applications 97/98 (I N 7/8) 
- Index of solved basic services 98/99 (1 P R 8/9) 
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Figure 3. Indexes of new applications and cost per basic service 


Graph: Relation between 98/99 indexes of new applications (series 1) and 
costs per basic unit (series 2) 


index 
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The first division is made on the throughput 
performance (percentage of cases solved 
within one month). We would expect from 
the group of 21 administrative districts with 
faster throughput on the right side of the tree 
to have also higher productivity than the 
group of the remaining 37 on the left side. 
The machine learning analysis proves this 


—— Series 
Series2 


Administrativ 


1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 districts 


expectation wrong. The share of the most 
productive ones (cost per unit around 20.000 
SIT = 83.68 EUR) is equal in both groups 
(1/3), while the least productive ones (cost 
around 40.000 SIT = 167.36 EUR) have a 
higher share in the group of 21 with the 
fastest throughput. 
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Figure 4. Cost per basic service dependence on throughput and education of employees in 1996 


Classes: 


Attributes: 


- Cost per basic service ‘96 (LC 96) - % of solved cases within 1 month '96 (% R 1mes.96) 
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% of solved cases within 1 month - 2 month '96 (% R 1-2mes.96) 


- % of solved cases within 2 month + month '96 (% R 2+mes.96) 


- Average education level of employees ‘96 (st. izobr.) 
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Very interesting are the findings on the in- 
fluence of education on productivity, as we 
can see from the group of 26 administrative 
districts in the lower middle part of the tree. 
The highest productivity (14.000 to 26.000 
sit — 20 + 6) is achieved by 7 districts where 
the average level of education is between 59 
and 61 points. The 5 administrative districts 
with average education level of 58 and below 
attain much lower productivity (33.000 to 
39.000 sit), with other attributes being the 
same. The 14 districts (in the low center of 
the tree) with highest level of education—62 
points or more, show the lowest productivity 
(especially 4 districts that solve only 26% 
or less cases in one month, with cost per 
unit 25.000 to 57.000 sit = 104.60 — 238.50 
EUR). This finding presents an urgent chal- 
lenge for the HRM specialists. 
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Productivity, Throughput and 
Average Education Level in 1999 


The analysis of the previous section was here 
repeated for the year 1999 (see Figure 5). The 
findings are similar to those of year 1996: 


Or 


This tree is also “telling” the same sad story 

about the absence of HRM. 

The first level of division is the same as for 
the year 1996 - on the basis of 66% solved 

cases in 1 month. Only 14 districts meet this 

criterion (in 1996, 21 districts). This reflects 

a decline in performance. 

On the left lower side of the regression tree. 

the previously found paradox appears again. 

The 17 districts with education level of 63 

and less has the productivity in terms of cost 
from 18.000 to 32.000 sit = 75,31 — 133,89 
EUR per basic service, while the other 7 ad- 
ministrative districts with education level of 


The Use of Data Mining for Assessing Performance of Administrative Services 


64 and more show much lower productivity 
with costs of basic service from 29.000 to 
55.000 sit = 121,34 — 230,13 EUR. Similar 
but not as significant relation is seen for the 
group of 12 administrative districts next to 
the right. 

d. Byobserving the group of 14 administrative 
districts on the very right side of the tree, 
we can see 6 administrative districts that 
solve between 67% and 78% of cases within 
1 month, to have the lowest productivity 
with costs 41.000 to 87.000 sit = 171.55 — 
364.02 EUR, while the other 8 administra- 
tive districts with even better throughput 
cycle show much better productivity with 
costs for basic service of 16.000 to 38.000 
SIT = 66.95 — 159.00 EUR. It looks that the 
throughput cycle is not in direct correlation 
with productivity. This relation was exam- 
ined by tracing same data “by hand”, and 
the results showed that the throughput cycle 
is a very stable attribute. For example some 
administrative districts experienced drop of 
applications by 50% which was followed by 
the lowering of productivity by almost the 
same percentage, while the rate of through- 
put cycle remained the same. Obviously 
the workers themselves regulate how many 
applications they process simultaneously at 
the one cycle time. 


Accuracy of Induced 
Regression Tree Models 


In the foregoing sections, we studied regression 
tree models induced from the data. We were 
mainly interested in qualitative relations between 
various parameters indicative of the productivity 
of administration districts. On the other hand, the 
generated tree models can also be used for numeri- 
cal prediction. For example, by using such a tree 
we could answer questions like: What would be 
the cost of basic service in a given administrative 


district ifthe number of new applications increased 
from 600 to 700? 

A regression tree answers such questions by 
a numerical prediction. An important question is 
how reliable the tree’s predictions are? Of course, 
only predictions for new cases (not the cases 
already seen among the existing data) are of true 
interest. We estimated the accuracy of trees’ nu- 
merical predictions on new cases by computing the 
following frequently used measures of accuracy 
of numerical prediction (Witten & Frank 2005): 


a. Mean squared error: 


MSE = 1S A 


i 


Here x (i) is the predicted class value of test 
case i, and x (i) is its true value. 
b. Relative mean squared error: 


MSE 


x 


c. Relative error: RE = 


To compute these measures of accuracy as es- 
timates of performance on new cases, we used the 
method of leave-one-out (Witten & Frank 2005). 
This is a standard accuracy estimation method 
used when the number of learning examples 
is relatively small (in our case N=58 which is 
considered as small). In this method, one of the 
N learning examples is excluded from the learn- 
ing set, a regression tree is computed from the 
remaining (N-1) examples, and the tree is applied 
to predict the class value of the excluded case as 
if this was a new case. This is repeated over all 
the N learning cases, each time one of the cases 
being used as a new, not yes seen case. 
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Figure 5. Cost per basic service dependence on throughput and education of employees in 1999 
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The following average values over all the 
generated trees were obtained: MSE=12.61, 
RMSE=30.86%, RE=23%. Although RE is not 
bad, these accuracy estimates indicate that the 
generated regression trees cannot be really rec- 
ommended to be used for numerical prediction. 
Therefore the use of generated regression trees 
should be mainly limited to qualitative analysis that 
aims at formulating interesting hypotheses in the 
domain, as we did in the foregoing sections. These 
hypotheses can then be used as useful qualitative 
indicators combined with indicative quantita- 
tive thresholds, when managing administrative 
districts with the aim at improved productivity. 

As already stated, predictive accuracy of the 
induced trees is hardly sufficient for making reli- 
able numerical predictions. Therefore the main 
value of these trees is as “food for thought” for 
the domain expert, as an aid to help her or him 
to develop the intuitive understanding of the 
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domain of exploration. It would be interesting to 
consider the question of statistical significance of 
the hypotheses (correlations) indicated by regres- 
sion trees: is the distribution of class values in a 
leaf statistically significantly different from the 
distribution in the whole domain. In the case of 
tree learning, this is however not straightforward 
to answer, and we avoid it because such results 
are typically misleading. A statistical test is in- 
tended to essentially answer the question: what 
is the probability of obtaining the observed class 
distribution in a leaf by chance when randomly 
sampling from the data in the whole domain? Such 
an analysis is sometimes provided in applications 
of tree learning. 

However, the results are usually not useful and 
are misinterpreted. The reason is that in the process 
of inducing a tree from the data, the induction al- 
gorithm considers many hypotheses in a subspace 
of all possible hypotheses that have the form of a 
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decision or regression tree. The resulting leaf is a 
result of the tree learning algorithm’s choice among 
the many hypotheses involved. This choice by the 
algorithm already aimed at optimizing the chances 
of the statistical test to succeed. This violates a 
basic assumption of the test — namely that what 
is tested is a single hypothesis defined prior to 
collecting observations, and not to test whether 
a significant hypothesis exists among the many 
hypotheses considered. A way to compensate for 
this multiplicity ofhypotheses would be, to use the 
Bonferroni correction. This however requires at 
least the knowledge of the number of competing 
hypotheses among which the learner selected the 
final tree. This number is, however, not provided 
by decision tree learning algorithms and it is not 
easy to estimate. 


HOW FINDINGS MAY HELP TO 
IMPROVE DECISION MAKING 


Relative absence of scientific and analytical re- 
search of processes in Slovenian public administra- 
tion, is the main reason why their performance is 
spontaneous and decision making is often mainly 
political and aimed to serve the employees, not 
the clients (Pečar, 2002). 

Data mining and machine learning results 
from this research may benefit managerial deci- 
sion making in several ways, both on the macro 
and micro levels. The processes of management 
on the macro level include mainly tasks of politi- 
cal decision making (law, policy, strategy, etc.). 
Major such process is legislature, for which are 
responsible: the Parliament, political parties, and 
the Government, mostly the Ministry of Public 
administration and Ministry of finance. These 
groups of decision makers need to know more 
about performance measurement as an impor- 
tant vehicle for New Public Management (Hood 
1998, Hood & Peters 2004). On this basis, they 
should promote policies to develop performance 
measuring systems for better decision making 


not only in administrative districts, but in public 
services in general. 

Many findings in this paper can directly benefit 
the processes of decision making of administra- 
tive districts in: 


e Human resource management 
° Strategic planning 

e Operational planning 

e Performance management 


Useful Findings Concerning Inputs 
of Administrative Processes 


The optimal amount of applications. The num- 
ber of applications (new and unsolved) has very 
significant correlation with productivity. To attain 
the highest productivity (cost of basic service 
between 50 and 120 EUR, there have to be more 
than 176 new applications per year per employee, 
or more than 243 new and unsolved applications 
together per year per employee. 

Education of employees. Higher level of 
education is generally in correlation with higher 
productivity, except for the anomaly where the 
group with the highest education shows below 
average productivity. This pattern of behavior 
should be further analyzed. A new hypothesis 
has emerged: that those with higher education 
feel that their job is more secure and the absence 
of a measuring system additionally lowers their 
motivation. 

Gender. Gender has a small influence on pro- 
ductivity. In our data, an optimal percentage of 
males in the structure of employees are between 
6 and 24%. 

Age. Influence of age on productivity is small, 
up to 10%. In our data, there is an indication of 
an optimal average age in the range from 39 to 
42 years. 
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Useful Findings about the 
Performance of Processes 


Productivity standards, performance measure- 
ment. The old administrative statistics in Slovenia 
neglected the huge differences between various 
administrative services. This research presents the 
first effort to classify time standards for various 
services, as anecessary precondition for measuring 
productivity. The time standards (productivity) 
for various services were previously unknown. 
Productivity is influenced significantly by the 
number of new applications and unsolved ones 
from the previous year “orders”. 

The size of administrative districts. The size 
of an administrative district also influences the 
productivity. Productivity increases with the size, 
with the exception of the largest districts, where 
productivity slightly falls. According to our data, 
an optimal size ofadministrative district is between 
21.000 to 43.000 inhabitants. 

Time of processing: from receiving an applica- 
tion to mailing the decision. Data on »through- 
put« was available only in 3 intervals — within 1 
month, between 1 to 2 months, and more than 2 
months. The average processing time in the four 
year period analyzed was more stable than pro- 
ductivity. Nearly half of administrative districts 
solved 67% or more cases within one month, but 
their productivity varied significantly. The time 
of processing (through-put) is not necessarily in 
correlation with productivity. 

Motivation for employees. 25% of administra- 
tive districts reported to practice stimulative pay 
(the law allows only 3% share of wages as stimula- 
tive pay). Considering this information as a binary 
attribute, the machine learning analysis did not 
find this attribute to influence productivity. Only 
two administrative districts measure individuals’ 
productivity with their own standards. Overall, 
stimulation is small and shows no influence on 
productivity. 
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Useful Findings about 
Outputs of Processes 


Solved cases. The rate of solved cases is in strong 
correlation with the trend of “orders”, Higher 
numbers of denied cases is in slight relation with 
higher productivity. 

Decisions overturned. Our data indicates that 
higher productivity does not produce more deci- 
sions which can be overturned on a higher level 
of decision making. Thus there was no correla- 
tion detected between productivity and decisions 
overturned. 


Testing Current Hypotheses 


At the beginning of this research, three hypotheses 
were set. Concerning these hypotheses we can 
conclude the following: 

Hypothesis 1: Administrative districts have 
very different efficiency (productivity) of work 
processes. This hypothesis was proven correct. 
Organizational efficiency (productivity) of ad- 
ministrative districts varies extremely, up to the 
ratio of 1: 10. 

Hypothesis 2: The level of employees’ 
education has major influence on efficiency. The 
research findings show general positive influence 
of higher level of education on increased produc- 
tivity, but an interesting behavioral pattern was 
found, where the highest (above average) level 
of education often lowers the productivity. Here 
the additional hypothesis evolves (and should be 
explored further) —that higher educated employees 
feel their job is more secure than those of lower 
educated ones, and fully exploit this advantage. 

Hypothesis 3: The increased number of appli- 
cations results in longer time for processing. The 
analysis proved this hypothesis to be wrong. The 
administrative districts with highest numbers of 
applications per employee show the highest level 
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of productivity and often highest share of cases 
solved within one month (throughput). 


CONCLUDING REMARKS, 
FUTURE TRENDS AND 
SUGGESTED RESEARCH 


The early expert opinion of production time for 
administrative services from the administrative 
districts proved to be way off— by almost a factor 
of 3 (exactly by 2,466). This adds the prevailing 
conclusion that the understanding of the work 
processes in administrative districts is very weak. 
The findings of this research attracted the attention 
of the ministries and support for further research 
was given. The next step was to define the standard 
time values for most of 350 various administra- 
tion services provided by administrative districts. 
The government also passed a law, that every 
public organization must monitor the efficiency 
(productivity) and this should be important for 
promotions and stimulative pay. 

Also the methodology of administrative sta- 
tistics was changed. By first trial of computerized 
analyses of all production data for 9-month period 
in 2006, the results showed a similar situation as 
in this preliminary research. Different productivity 
was found in all departments. The computerized 
analyses made possible to track also the produc- 
tivity of individuals. These differences naturally 
exceeded organizational ones by far. The govern- 
ment officials than decided not to reveal all the 
data, but to track the results of analyses in longer 
period and at the same time look for the solutions 
of workers mobility from one district to another. 
However the study had a positive effect on the 
awareness of public managers and policy makers. 

The use of data mining has proven its benefits 
and should become an essential part of many fu- 
ture decision support systems not only in private, 
but also in public sector. Its main benefits are 
demonstrated in quick analysis and diagnosis of 


multivariable situations, and detection of more or 
less hidden behavior patterns. 

A prerequisite for using such data mining 
tools is well organized data bases of comprehen- 
sive measurements. This way data mining and 
machine learning (as an increasingly important 
managerial tool) can significantly add new value 
to the generic task — decision making in areas of 
strategic planning and performance management. 
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KEY TERMS AND DEFINITIONS 


Public Administration: Public administrative 
bodies at different levels of government, in this 
study — administrative districts (orig. “Upravne 
enote”) 
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Administrative Services: Services provided 
by public administration, in our case differen: 
permits upon application. 

Basic Unit of Administrative Services: The 
least time consuming administrative service, used 
in the study as basic unit (for expressing time values 
by ratios for all other administrative services). 

Performance Management: Manageria 
task of monitoring, understanding and improving 
performance of workers. 

Productivity Analysis: Studying relations 
between outputs — amount of administrative ses- 
vices produced, and inputs — working hours used 
(and other resources); often expressed in costs of 
producing a unit of output. 

Data Mining: Discovering regularities, o 
knowledge, from typically large sets of data using 
techniques such as machine learning 

Machine Learning: An area of artificis 
intelligence concerned with methods that enable 
computers to improve their performance by lears- 
ing from experience, or data; an important aspas 
of this, central to this paper, is the capability © 
machine learning methods to discover genera 
laws from data 

Induction of Regression Trees: A machine 
learning technique for extracting from data = 
(non-linear) dependence between a numerics 
“target” variable and other variables in the system 
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ABSTRACT 


Productivity is a key success factor in any organization. In order to improve productivity, it is necessary 
to understand how various factors affect it. The previous research has mainly focused on productivity 
analysis at macro level (e.g. nations) or in private companies. Instead, there is a lack of knowledge about 
productivity drivers in public service organizations. This study aims to scrutinize the role of various op- 
erational (micro level) factors in improving public service productivity. In particular, this study focuses 
on child day care services. First, the drivers of productivity are identified in light of the existing literature 
and of the results of workshop discussions. Second, the drivers most conducive to high productivity and 
the specific driver combinations associated with high productivity are defined by applying methods of 
data mining. The empirical data includes information on 239 day care centers of the City of Helsinki, 
Finland. According to the data mining results, the factors most conducive to high productivity are the 
following: proper use of employee resources, efficient utilization of premises, high employee competence, 
large size of day care centers, and customers with little need for additional support. 


INTRODUCTION increasing demand with limited resources. Produc- 
tivity iscommonly regarded as an organization’s key 
Productivity improvement is high on the agenda in success factor. On the level of national economy, 
many public organizations. Itisnecessarytoimprove productivity improvement has been linked to many 
the productivity of public services in orderto satisfy economic and social phenomena, such as economic 
growth and high standard of living (Miller, 1984; 
DOI: 10.4018/978-1-60566-906-9.ch005 
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Sink, 1983). Research in various disciplines typi- 
cally applies different approaches in productivity 
studies, for instance, national economists are more 
interested in macro level perspectives, whereas 
researchers of industrial management and busi- 
ness economics typically examine productivity 
at micro level (Käpylä et al., 2008). 

Productivity is a traditional research topic on 
which there is a rich body of literature. In a recent 
Finnish study examining the current status of pro- 
ductivity research, it was concluded that the effects 
of various factors on productivity are verified too 
rarely (Käpylä et al., 2008). Productivity effects 
are often studied at macro level, e.g. national level 
(e.g. Lambsdorff, 2003; Skans, 2008). In order to 
improve productivity by managerial means, the 
drivers of productivity must be identified. These 
drivers can be utilized for different purposes, such 
as identifying development targets. Many initia- 
tives of productivity improvement have proven 
to be inefficient and have met with resistance 
among employees due to the implementation of 
harsh decisions (e.g. job cuts) as the only means 
to improve productivity. However, many different 
factors may in practice affect productivity. These 
factors should somehow be connected to micro 
level organizational operations in order to identify 
concrete development targets (e.g. improving the 
division of labor). 

The understanding of the role and the signifi- 
cance of various productivity drivers is still in its 
infancy. This study seeks to establish what role 
various managerial (micro level) factors play in 
improving public service productivity. In this 
study, productivity improvement is examined 
from the point of view of service provides. The 
research questions are the following: 


* What are the drivers of productivity? 

° Which drivers are best related to high 
productivity? 

° Which specific driver combinations are as- 
sociated with high productivity? 
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In particular, this study focuses on child day 
care services. The first question is answered in 
light of the literature and the results of workshop 
discussions. The second and third questions are 
examined by applying data mining methods. Since 
the specific productivity driver combinations are 
beforehand unknown, data mining methods will 
provide a relatively easy way of gaining insight 
into the relationships between these various 
drivers. The empirical data includes information 
on 239 municipal day care centers of the City of 
Helsinki, Finland. 

First, we summarize the relevant literature 
on the productivity of public services in general. 
Then follows a discussion of factors contribut- 
ing to productivity specifically in child day care 
services. The data and measures used in this study 
are presented with a brief description of the data 
analysis methods used. Finally, the results of the 
empirical examination are reported. In addition, 
conclusions (including the contribution and limita- 
tions of the study and future research suggestions) 
are presented at the end of this chapter. 


ASSUMED DRIVERS OF 
PRODUCTIVITY 


Literature on Productivity 
of Public Services 


Productivity is traditionally defined as the ratio 
between output (e.g. the quantity of services pro- 
duced) and input (e.g. the number of employees 
needed for such production) (Sink, 1983). This 
definition is interpreted slightly differently in dif- 
ferent research disciplines. According to Pritchard 
(1995), allcommonly used productivity definitions 
can be classified into one of three categories: 


e The economist/engineer approach, where 
productivity is seen as an efficiency mea- 
sure (outputs/inputs) 
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e The approach where efficiency (outputs/ 
inputs) and effectiveness (outputs/goals) 
are evaluated simultaneously 

e The broad approach, which comprises ev- 
erything enabling an organization to func- 
tion better. 


The wider approach, including both efficiency 
and effectiveness, is in its meaning quite close to 
the concept of performance (e.g. Kaydos, 1999). 
The service productivity literature has opted for 
a wider examination of productivity by underlin- 
ing factors such as the quality of service (Sahay, 
2005), utilization of service capacity (Grönroos & 
Ojasalo, 2004) and the role of customers in service 
provision (Martin et al., 2001). In the context of 
public services, productivity has been related to 
the cost-efficiency and quality of services (Faucett 
& Kleiner, 1994; Hodgkinson, 1999). Several 
theoretical models of service productivity can 
be found (Grönroos & Ojasalo, 2004; Johnston 
& Jones, 2004; Parasuraman, 2002). 

Atypical feature of services is that they cannot 
be stored (Grönroos, 1984). Services are often 
consumed simultaneously with their produc- 
tion. Therefore, the poor quality of services can 
hardly be concealed. Numerous studies have been 
conducted on the connections between quality, 
productivity and profitability. Conflicting thoughts 
have been voiced on the issue (He et al., 2007; 
Huff et al., 1996). According to Reichheld and 
Sasser (1990), reducing defects means greater 
loyalty, which is related to productivity in service 
companies. High customer satisfaction (which 
can be assumed to indicate high quality) reduces 
the need for resources since there will be less 
reworking, returns and complaints (Huff et al., 
1996; Westlund & Léthgren, 2000). On the other 
hand, it can be argued that increasing customer 
satisfaction means more work and costs, thereby 
reducing productivity. This is a common view in 
economics, for instance. (Anderson et al., 1997) 
It seems that the role of quality in productivity 
is dependent on how the productivity concept is 


understood. If the wider approach related to ef- 
ficiency and effectiveness is applied, poor quality 
will obviously mean a decline in productivity. 
According to Anderson et al. (1997), productivity 
and customer satisfaction are positively related to 
profitability in any industry. Based on their results 
they also concluded that simultaneous attempts 
to improve productivity and customer satisfac- 
tion while pursuing better profitability are more 
difficult in services than in manufacturing opera- 
tions. In other words, increasing productivity may 
decrease customer satisfaction and vice versa. The 
results of the research by He et al. (2007) corrobo- 
rated this. They concluded that service companies 
need to strike a balance between productivity and 
customer satisfaction improvements in order to 
achieve optimal profitability. 

While the role of employees in the provision 
of services is often visible to the customers, fac- 
tors such as employee satisfaction and employee 
turnover may affect service outputs. Appelbaum 
et al. (2005) identified a positive link between 
low employee satisfaction and low productivity. 
According to the study by Westlund and Léthgren 
(2000), employee satisfaction has a positive impact 
on customer satisfaction as well as on productiv- 
ity. Many factors possibly causing decrease in 
productivity have been linked to employee turn- 
over. It has been estimated that it takes at least a 
year before a new employee has adapted to a new 
task and realized his/her productivity potential 
(Kransdorff, 1995). When an employee is lost 
for some reason, it takes an average of four years 
to replace competence losses (Hall, 1992). Costs 
may be generated through numerous factors (e.g. 
interruption in work processes, recruitment and 
training) (Dess & Shaw, 2001; Sutherland, 2002). 

The competence of employees may also be 
a factor affecting productivity. According to the 
resource-based view of a firm, the internal re- 
sources of an organization such as competencies 
determine the success of the organization (Penrose, 
1995). Aconnection has been established between 
proper human resource management practices and 
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improved productivity (Delaney & Huselid, 1996; 
Xu et al., 2006). However, the positive impact of 
competence improvement on productivity is not 
straight-forward. Even though it has been esti- 
mated that an investment in intellectual capital 
(includes also other intangible resources in addi- 
tion to competence) can yield twice the benefit 
compared to a similar investment in a physical 
asset (Abernethy & Wyatt, 2003), the findings 
of a study by Väisänen et al. (2007) suggest that 
investments in intellectual capital have a positive 
impact on productivity only in the long-term. 

Factors related to capacity utilization are fun- 
damental to productivity examination since they 
are directly linked with the output/input ratio. The 
efficient utilization of service provision resources 
is demanding, since inputs may not adapt directly 
to the fluctuation in demand (Sahay, 2005). One 
essential factor behind resource utilization appears 
to be linked to employee absenteeism. If sickness 
absences could be reduced it would be possible 
to compensate for the labour shortage caused by 
the rapid aging of the workforce in many indus- 
trialized countries (Béckerman & Ilmakunnas, 
2008). According to Miller et al. (2008), there is 
only limited literature on the impact of employee 
absenteeism on productivity. They studied the im- 
pact in schools and identified that teacher absences 
reduce productivity by having a negative effect 
on students’ grades. The effect of absenteeism on 
productivity has also been studied by Allen (1983). 
He suggested that productivity effects depend on 
the job: ifitis difficult to find replacement workers 
or to reassign workers from other positions, the 
productivity effects are more severe. 

It is a characteristic of service production 
that customers are often active participants. 
Consequently, a customer may affect productiv- 
ity (positively or negatively) (Ojasalo, 2003). 
Customer involvement also causes variation in 
service provision, and it may therefore be diffi- 
cult to standardize service outputs (McLaughlin 
& Coffey, 1990). A service providing unit with 
customers urgently needing special services may 
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need more inputs for providing certain standard 
outputs (e.g. the number of provided care days). 

In light of the literature the following factors 
may be summarized to have an effect on public 
service productivity: quality, customer satisfac- 
tion, employee satisfaction, employee turnover, 
employee competence, capacity utilization, ab- 
sences of employees and customers. In addition 
to the studies examining the factors affecting 
productivity, there is a lot of research on those 
factors without the productivity link, but focusing 
on therelationships between different productivity 
drivers. Some key findings of these studies are 
now briefly presented. 

Several factors may affect service quality. 
According to Xu et al. (2006), attention must be 
paid to employee competence to ensure the desired 
level of quality performance as perceived by cus- 
tomers. Employee turnover has also been studied 
in relation to quality. Low employee satisfaction 
has frequently been mentioned as a key factor 
causing resignations (see e.g. Hom & Kinicki, 
2001; Lum et al., 1998). In the study by Hurley 
and Estelami (2007), it was found that employee 
turnover predicts customer satisfaction as effec- 
tively as employee satisfaction. High employee 
turnover may cause loss of experienced employees 
and established customer relationships which, in 
turn, may be detrimental to the customer. 

Many factors have been linked to employee 
absenteeism. There is evidence that job dissat- 
isfaction increases sickness absences (Brown & 
Sessions, 1996; Farrell & Stamm, 1988). Böcker- 
man and Ilmakunnas (2008) found evidence that 
adverse working conditions are linked with job 
dissatisfaction which, in turn, is related to sickness 
absences. They also concluded that improvement 
of working conditions is an essential way to 
reduce sickness absences. According to a study 
by Lehtonen (2007), educational activities and 
competence development may help in improving 
employee satisfaction. In the study by Steel and 
Rentsch (1995), it was found that high educational 
level indicates lower level of absenteeism. They 
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also stated that good employee commitment is 
linked to fewer sickness absences. Lehtonen 
2007) also identified a link between high educa- 
tional level and small number of sickness absences. 
This may be because assignments requiring high 
education are more challenging and compelling. 
People may well work even when they are sick. 


Productivity Drivers of 
Child Day Care Services — 
Workshop Discussions 


Child day care centers account for a large propor- 
tion of the costs of public services in Finland. 
For example, at the Social Services Department 
of the City of Helsinki they account for around 
one fifth of the total costs. Therefore, it can be 
argued that the understanding of factors affecting 
productivity in the context of child care is relevant 
from the perspective of national economy. From 
the perspective of the Social Services Depart- 
ment there was a clear managerial need to gain 
deeper understanding on productivity drivers and 


their role in improving productivity. This would . 


provide better understanding on the performance 
measures used — what is the role and relevance 
of different measures in managing productivity. 
This would also guide in planning the way the 
services are provided — what kind of units are 
ideal for proving the services. Initially, there were 
understanding and presumptions on these issues 
but a lack of knowledge based on the in-depth 
analysis of various quantitative measures. From 
the point of view of research on data mining, child 
day care services were a suitable context. They 
have by far the highest unit volume in the service 
production ofthe Social Services Department. The 
large number of units providing similar services 
offers rich empirical data for the research setting 
of this paper. 

The productivity drivers and their assumed 
relationships in child care services were identified 
by combining the results of the literature review 
(see above) with knowledge obtained through 


workshops. Three representatives of child day care 
services and two employees from the financial 
department were involved in three workshops 
held in spring 2007. In addition, one of the authors 
participated in the workshops as a facilitator. 
The purpose of these workshops was to identify 
factors affecting productivity in order to support 
the productivity improvement of child day care 
services. This research setting made it possible to 
utilize the knowledge and long experience of the 
personnel working in the organization studied. 
In practice, different productivity drivers iden- 
tified through the review of the literature were 
presented and evaluated in the context of child 
day care. All the discussions were documented for 
research purposes. As aresult, a figure representing 
the drivers of productivity and their relationships 
was drawn. On the basis of this work the model 
for productivity drivers in child day care services 
was constructed (see Figure 1). It is used as a 
framework for the empirical part of this study. 
Figure | includes both direct and indirect driv- 
ers of productivity. Employee resources, the 
utilization of premises and additional support need 
are examples of factors assumed to directly affect 
the level of productivity. Indirect productivity 
drivers may affect productivity but only through 
another productivity driver. Employee satisfaction 
and the absences of children are examples of the 
indirect productivity drivers in the model. 


RESEARCH METHODS 
Data and Measures 


The empirical data was gathered from the City of 
Helsinki, Finland. More specifically, this study 
focuses on municipal child day care services. The 
child day care services are classical services with 
close interaction between a service provider anda 
customer. This means that the provision of services 
is very employee-intensive. Since employees are 
the key input resources, it is necessary to pay at- 
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Figure 1. Assumed drivers of productivity and their relationships 
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| Additional support need } 


Productivity 


Competence and experience 
of employees 


Customer satisfaction 


Employee satisfaction 


tention to human resources management. There are 
both educational and caring professionals working 
in day care centers. In general, the employees in 
the centers are very experienced — more than 2/3 
of employees have worked more than 10 years. 
Another relevant characteristic of child day care 
relates to the actual service process. Even though 
customers (both children and their parents) have 
clearly an impact on the provision of services, 
the service process is rather standard — in general 
there is a low level of customization that depends 
on a child. 

There are 17 day care district in the Helsinki 
region, with a total of 279 day care centers. The 
empirical data used in this study excludes Swedish 
language day care centers (31 centers). This was 
because they are typically smaller and constitute 
a separate entity. In addition, nine centers provid- 
ing 24-hour services were excluded, because they 
are quite different in terms of costs and personnel 
structure compared to typical day care centers. 
Finally, information for the years 2006 and 2007 
related to 239 day care centers was taken for 
examination. 


33 


Employee resources 
Utilization of premises 


Absences of children 
Employee turnover 


The measures used in this study are summarized 
in Table 1. The main reason for choosing these 
measures for the factors identified previously (see 
Figure 1) is that they were already widely used 
in the day care services of the City of Helsinki. 
Thus, they were considered cost efficient. It was 
not considered reasonable to take new measure- 
ments for the purposes of this study, for example, 
due to the tight time schedule and limited human 
resources. However, due to the foregoing, some 
of the factors identified (i.e. employee satisfaction 
and employee turnover) could not be included in 
the empirical examination, because the authors 
were unable to obtain such data from the day care 
management. In addition, most of the information 
required needed some modification and manual 
work from both the representatives of day care 
management and the authors before it was suitable 
for analysis. Finally, the data included in the study 
contains eight possible drivers of productivity. The 
measures used in this study are discussed more 
thoroughly in the following. 

Productivity is measured by the average cost 
of calculated care day. The measure is calculated 


Productivity Analysis of Public Services 


Table 1. Measures for productivity and proposed drivers of productivity 


Employee satisfaction = 


Employee turnover 


as follows: total costs divided by the number of 
calculated days of care (the content of which is 
defined by regulation). The weighting of the 
output is dependent on the age of the children, 
likewise the need for special care services. The 
information needed was taken from two different 
data sources. Total costs were taken from the 
AdeEko data system, and the number of days from 
the Effica/DW data system. The information is 
calculated monthly at day care center level. Mov- 
ing average values are used. It should be noted 
that a high measurement result indicates low 
productivity and vice versa. The customer satis- 
faction measure is based on a customer satisfac- 
tion survey. The survey includes 12 questions, 
answered ona Likert scale 1-5. The questionnaire 
is intended for the parents of all children in day 
care in Helsinki. Data is collected at a day care 
center level every second year. Thus, in this study, 
the average result describing customer satisfaction 
with the respective day care centers was used. 
Moreover, the information on customer satisfac- 
tion in 2006 was also used for 2007. 

The absences of employees is measured as an 
average percentage of days lost through sickness 
in a year. The measure is calculated as follows: 
the number of sickness days divided by the total 
number of working days (during a year). Person- 


Measure 


Customer satisfaction Customer satisfaction according to a survey (scale 1—5) 


Percentage of days lost through sickness (%) 
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nel having more than 60 days a year lost through 
sickness were excluded from this study. This is 
the maximum number of sickness days (in a year) 
during which an employee continues to receive 
full salary. The information used in this study 
describes sickness absences at day care center 
level in 2006 and 2007 and is available in the 
Hijat information system data base. 

The data needed for employees ‘working experi- 
ence is available in the Hijat information system 
data base. It was decided to use the percentage of 
employees who have worked less than 10 years 
in the day care center as a measure for working 
experience. The information is available at day 
care center level for both years. The measure for 
employees ‘competence is based ona working wel- 
fare questionnaire. One of the main components 
of the questionnaire focuses on the competence 
of employees (5 questions). Each employee esti- 
mates his or her own competence using a Likert 
scale 1-5. The average values of day care centers 
(calculated according to the individual responses) 
were used in this study for both 2006 and 2007. 
However, the day care centers with fewer than 
five responses were excluded. 

The utilization of premises is simply measured 
by the degree of utilization of premises, more 
specifically by dividing the number of children by 
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the places available. The exact result is calculated 
as an average degree per month. However, June 
and July are excluded from the examination as 
these months are typically holiday periods and do 
not represent the usual situation. The information 
is available in the Effica/DW data system. The 
number of children per number of employees was 
used as a measure for the utilization of employee 
resources. The information is provided by the Ef- 
fica/DW data system and calculated every month 
at the level of day care centers. 

The absences of children is measured by the 
percentage of days lost through sickness per child. 
This information is gathered at day care center 
level monthly and held in the Effica/DW data 
system. In this study, the average percentage of the 
year is used. Additional support need is measured 
using the percentage of S2 children. S2 children 
refer to the children who do not speak Finnish or 
Swedish (the official native languages) as their 
first language and receive teaching of Finnish 
language. This information is not embedded in 
the calculated day of care discussed above. An 
average percentage per year is calculated in each 
day care center. The information is available at 
the Effica/DW data system. 

In addition to the variable described above, 
the data used in this study includes three other 
variables: first, a variable referring to the division 
of the day care center (a division may include one 
or more centers), second, a variable referring to 
the region of the day care center (a region may 
include one or more divisions) and, third, a vari- 
able describing the size of the day care center 
(measured by the number of calculated places at 
the end of the year). 


Analysis Methods 


The data were examined using various data min- 
ing methods to ensure that the proper productivity 
drivers are identified. For directed data mining 
methods (see e.g. Kudyba & Hoptroff, 2001) the 
obvious output variable is productivity itself. 
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Prior to the data mining process, an exploratory 
data analysis was performed to gain insight into 
the distribution shapes of the variables as well as 
possible outliers and relationships within the data. 

The data mining tasks were carried out by SAS 
Enterprise Miner tool. The data mining tools used 
in this study included decision trees and cluster 
analysis (see e.g. Giudici, 2004). Decision trees 
gave the authors insight into which productivity 
drivers are best related to high productivity. Ad- 
ditionally, tree models form the structure of the 
model from the actual data itself, rather than having 
a researcher specifying it a priori. Tree models 
recursively divide a set of7 statistical units accord- 
ing toachosen division rule intended to maximize 
the homogeneity of the response variable. At each 
step of the tree model, the procedure splits the 
observations according to the variable that best 
explains the target variable (Giudici, 2004, p. 100, 
Han & Kamber, 2006; Hand et al., 2001), which 
in this case is productivity. Additionally, perform- 
ing cluster analysis on the whole data served to 
reveal a set of clusters where productivity is high, 
along with the drivers that are associated to that 
specific cluster. Cluster analysis aims to find a 
set of groups where observations within groups 
are as homogenousas possible, i.e. finding which 
observations in a data set are similar (Aldenderfer 
& Blashfield, 1984; Romesburg, 2004). In this 
case, we wanted to find out whether day care 
centers with high productivity have some other 
factors in common. 

The main motivation for such techniques in 
this study was the relatively easy interpretation of 
both techniques. The tree models are commonly 
depicted as tree-like diagrams with the contents 
of each data subset clearly defined. This makes 
the results more usable for the any interested au- 
dience — e.g. the day care center managers — that 
may not have a strong background in statistics 
or data mining. 

In this study, only the direct linkages between 
various drivers and productivity were examined. 
Thus, indirect effects (e.g. absences of children 
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Table 2. Summary statistics 


Year 
2006 
2007 
2006 
2006 
2007 
2006 
2007 
2006 
2007 
2006 
2007 
2006 
2007 
2006 
2007 
2006 
2007 


Measure 


Cost of calculated care day (€) 


Customer satisfaction according to a survey 


Percentage of days lost through sickness (%) 
Percentage of employees having worked less than 10 
years within the day care center (%) 


Employee competence according to welfare survey 


Degree of utilization of premises (%) 


Percentage of days lost through sickness per child (%) 


Number of children per number of employees (ratio) 


Percentage of S2 children (%) 


— utilization of premises — productivity) were 
not included. 


RESULTS 
Description of the Data 


Table 2 contains descriptive information related 
to the productivity measure and the eight possible 
productivity drivers. The size of the day care 
center varies from 21 places to 147 places in the 
year 2006; the corresponding figures being 21 
and 154 in the year 2007. 

Before analyzing the relationship between 
productivity and various drivers, it is essential to 
understand how productivity differs in the day 
care centers included in the study. In Figure 2 and 
Figure 3 the distributions of the productivity 
measure for 2006 and 2007 are presented. 

Visual examination reveals the existence of 
extreme positive outliers in the productivity data 


for both 2006 and 2007 samples. Since we used 
tree models and clustering it is recommended to 
remove such outliers from the data (Giudici, 2004). 
The MAD (Median Absolute Deviation) procedure 
was used to filter out any outliers from the data 
set (for more information on MAD, please refer 
to Pearson, 2005). Since the data set is rather small 
compared to many other data mining sets, a mod- 
erate factor of 9 deviations from the median was 
chosen. After the filter outliers procedure the 
Distributions for productivity look as presented 
in Figure 4 and Figure 5. 

Even though there still seem to be some unusu- 
ally high observations in the 2007 productivity 
variable, no more filtering was done to ensure a 
suitable amount of observations for analyses. 


Data Mining Results 
First, we applied a decision tree model to both 2006 


and 2007 datasets with measured productivity 
as a target variable. Several tree structures were 
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Figure 2. Distribution of productivity in 2006 
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Figure 3. Distribution of productivity in 2007 
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tested before the final result with the following 
criteria: (a) only binary splits were allowed to 
maintain the readability of the tree, as well as 
allowing a suitable amount of observations per 
leaf; (b) a minimum of 20 observations per leaf 


was determined. Figure 6 presents the selected 
tree model for the 2006 sample. 

In 2006, the most differentiating factor is the 
number of children per single employee with 
higher number yielding better productivity. Ad- 
ditionally, employee competence has an impact 
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Figure 4. Distribution of productivity in 2006 afier the MAD procedure 
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Figure 5. Distribution of productivity in 2007 after the MAD procedure 
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on productivity, but even if the employee com- 
petence is high, the benefits are not fully realized, 
if the utilization of premises is not high as well. 
With less competent personnel, smaller improve- 
ments in productivity are seen in centers with 
fewer S2 children. Correspondingly, Figure 7 
presents the selected tree model for the 2007 data 
set. 


21.551724138 


As in 2006, in 2007 the single most decisive 
factor in differentiating the productivity of a 
random day care center is the number of children 
per employee; the higher the number, the better 
the productivity. Furthermore, large day care 
centers seem to be more productive than smaller 
ones, most probably due to economies of scale. 
Even in large day care centers the higher the 
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Figure 6. Decision tree for the 2006 sample 
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Figure 7. Decision tree for the 2007 sample 
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customer satisfaction, the better the productivity. 
Hence, these two factors may not conflict as sug- 
gested in the literature. It should be noted that in 
small centers less experienced personnel seems 
to affect the center’s productivity positively. More 
experience may mean more costs since salaries 
tend to rise over the career. However, long expe- 
rience does not necessarily increase the outputs. 

To sum up, in both 2006 and 2007 samples 
the number of children per employees (i.e. the 
measure of employee resources) has the greatest 
impact on productivity. This makes sense, as the 
main amount of costs to the day care center ac- 
crues from employee salaries. The provision of day 


QA 


care services is employee-intensive and therefore 
it is necessary to pay attention to the productive 
use of employee resources. Since this should be 
obvious, we omitted the number of children per 
employees variable, and constructed the same 
decision trees as before. Figure 8 presents the 
tree for the 2006 sample and Figure 9 presents 
the tree for the 2007 sample. 

When the number of children per employees 
variable was removed from the tree, the utilization 
of premises rose out as the single most determin- 
ing factor affecting productivity in 2006. After- 
wards the number of S2 children becomes impor- 
tant with smaller number of S2 children yielding 
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Figure 8. Decision tree for the 2006 sample with the number of children per employees variable removed 
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Figure 9. Decision tree for the 2007 sample with the number of children per employees variable removed 
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better productivity. Still, even with the smaller mining the centers’ productivity. Similar to the 
number of S2 children, smaller day care centers 2006 sample, the number of S2 children has an 
do not seem to be any more productive that the effect on productivity here as well. 
average day care center in the sample. On the This study was set out to find the specific 
other hand, if the number of S2 children is large, factors that affect productivity rather than con- 
competent personnel may provide some addi- centrating on the actual fiscal amounts. To assess 
tional productivity. This is an understandable the accuracy of the models, SAS Enterprise Miner 
result since competent personnel are needed to allows the user to assign a profit function to each 
take care of children needing extra support. of the obtained models (see e.g. Matignon, 2005). 
In the 2007 sample the size of the day care The main idea behind the assessment was not to 
center is the most discriminating factor in deter- compare the models between each other but rather 
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Figure 10. Cumulative profit functions for all four tree models with zero baselines 


Tool N. 
[E Baseline [tree (Jtree2 []Tee3 E tiee4 


to a random occurrence. The profit function was 
defined as a linear function that contains value 0 
for the average target (productivity) value, i.e. the 
model does not increase the probability of higher 
than average productivity. For the maximum 
productivity occurring in each of the samples, 
the profit function contains value 1. As can be 
seen from Figure 10, for the first three deciles, 
the cumulative profit values for all the models 
ranged between values 0.188 and 0.272 (random 
guess would give us the value 0). This implies that 
our models return better than average productivi- 
ties, and can be used to extract factors affecting 
productivity. 

In summary, the tree models clearly suggest 
that the following items have positive influence 
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on a day care center’s productivity: (1) Large 
number of children per employees; (2) Efficient 
utilization of premises; (3) High employee com- 
petence, (4) Large size of center and (5) Small 
number of S2 children. Measures related to the 
use of employee resources and the utilization of 
premises could have been used as alternative 
measures of productivity since they are clearly 
linked to output/input ratio. Therefore, it is natu- 
ral that they are strongly linked to productivity. 
It seems to be more efficient to provide services 
in large centers. The need for high employee 
competence makes sense, likewise the small 
number of S2 children. More inputs are needed 
in providing services for S2 children. However, 
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Figure 11. Input means plot for the 2006 sample 
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these inputs do not have an effect on the output 
of the measure used in this study. 

To gain more insight into how the day care 
centers in the sample differ from each other, a 
cluster analysis was performed on both samples. 
Because of the small number of observations 
available, and for easier interpration of the results, 
three clusters were used. Cluster analysis does not 
predict the behavior of any variable, nor does it 
classify other variables according to the values 
of others, but rather describes the similarities of 
values within the data set. This is used to find 
similarities between those day care centers with 
high productivity. 


The input means plots for both 2006 and 2007 
samples are presented in Figure 11 and Figure 
12, respectively. The input means plots rank the 
normalized input means for each variable in each 
cluster relative to the overall input means. This 
way we can easily spot any clusters that are as- 
sociated with higher than average productivity. 

In the 2006 data only one cluster (Cluster 2) 
has a smaller than average cost per day. By ex- 
amining cluster 2 more closely, we can extract 
factors associated with that factor. The input means 
plot presents the clusters, and the normalized 
factor the means associated with each cluster. 

The analysis shows, that day care centers in 
Cluster 2: 
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Figure 12. Input means plot for the 2007 sample 
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e have average personnel experience It is not surprising that large day care centers 
e are larger than average seem to be more productive than smaller ones due 
e utilize the premises well to economies of scale. The utilization of premises 
° have lower than average number of chil- is in the core of productivity phenomenon and it 
dren per employees was assumed to have a positive impact on produc- 
° have approximately average amount of S2 tivity. A slight surprise is that it is not necessary 
children to have many children per employee in order to 
° have highly competent personnel achieve high productivity. 
e rank very well in customer satisfaction Cluster analysis in the 2007 sample revealed 
survey two clusters with above average productivity 
° have slightly more personnel sick days (Clusters 1 and 3). Interestingly, in Cluster 1, the 
than average size of the day care center is well above average 
e have more children sick days than average. and in Cluster 3 below average. Since the vari- 


ables in the input means plot are listed in order of 
importance, and since the tree models suggest 
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that large day care centers are more productive, 
we classified the clusters as “small productive day 
care centers” (Cluster 3) and “large productive 
day care centers” (Cluster 1). 

According to the results presented in Figure 
12, small productive day care centers (Cluster 3): 


* have less than average competent 
personnel 

* have very high number of children per 
employees 


* have less experienced personnel 

. utilize the premises as average 

° have high number of personnel sick days 

e have high number of S2 children 

* have very low number of children sick days 

* score lower than average in customer sat- 
isfaction surveys. 


In small day care centers, there seem to be a 
need for high number of children per employee 
in order to achieve high productivity. Employees 
with high level of competence or long working 
experience are not necessities. In addition, small 
productive day care centers appear to require a 
small number of children’s sicknesses. It is nec- 
essary to have all the children present in order 
to utilize the resources well in striving for the 
maximum number of outputs. 

Instead, large productive day care centers 
(Cluster 1): 


e have highly competent personnel 

: have lower than average number of chil- 
dren per employees 

e have approximately average experienced 
personnel 

s have very intensively utilized premises 

e have a large amount of personnel sick days 
(not as high as small productive day care 
centers) 

e have a large number of S2 children (not as 
many as small productive day care centers) 


° have greater than average number of chil- 
dren sick days 

° score higher than average in customer sat- 
isfaction surveys. 


Large productive day care centers seem to 
have highly competent personnel with average 
experience. These units are productive even if 
they do not have many children per employee. 
However, the utilization of premises needs to 
be high. A rather surprising result in light of the 
existing literature is that large productive day 
care centers (and small units too) do not need to 
have a small number of personnel sickness days. 


CONCLUSION 


This study provided answers to the three questions 
posed. First, a literature review and knowledge 
obtained through workshops were used to arrive 
ata theoretical framework presenting the assumed 
drivers of productivity. The framework was used 
as a basis for the empirical examination. Second, 
a large data sample was used in the examination 
of the linkages between assumed drivers and pro- 
ductivity. Both decision trees and cluster analysis 
were applied. According to the results, the drivers 
that are most closely related to high productiv- 
ity are the following: proper use of employee 
resources, efficient utilization of premises, high 
employee competence and customers with little 
need for additional support. In addition, the large 
size of centers seems to have a positive impact on 
productivity. Third, this study failed to provide a 
clear answer to the question: which specific driver 
combinations are associated to higher productiv- 
ity? Instead, the cluster analysis revealed that vari- 
ous driver combinations may contribute to high 
productivity in the day care center. According to 
the results, it is easier to achieve high productiv- 
ity in large day care centers. Hence, productivity 
may be high even if there are not many children 
per employee. In light of this, it can be argued 
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that productivity could be improved in a more 
human way by paying attention on the size of day 
care centers. Interestingly, customer satisfaction 
seems to be higher in large and productive day 
care centers. A surprising result of this study is that 
productivity can be high even if there are many 
sickness absences among employees. 

Prior research has paid a lot of attention to 


productivity improvement and measurement.. 


However, there is a lack of knowledge about 
productivity drivers. In addition, most of the 
productivity research has focused on the macro 
level examination or the context of traditional 
(e.g. manufacturing) companies. This study 
contributes to productivity research by providing 
new knowledge about the drivers of productivity 


in the context of public services. The results help 


to better understand the phenomenon and vari- 
ous factors related to it. The results of this study 
rely on a large amount of empirical data. To the 
best of the authors’ knowledge, this type of data 
has not earlier been used in service productivity 
research. The information may also be useful in 
practice. The managers of child day care services 
can utilize the results when investing in produc- 
tivity improvement. The development work can 
be targeted at the right factors, i.e. drivers best 
related to high productivity. 

Also it should be noted that not all the pro- 
ductivity affecting factors (e.g. center size and 
customer structure) may be controlled by day care 
center managers but by higher level managers and 
the politicians. The impacts of the improvements 
may be substantial. Due to the large number of 
service providing units, small improvements inthe 
right factors may yield substantial savings of public 
money. If the cost of care day could be reduced 
by one or two euros in every municipal day care 
center of the City of Helsinki, yearly savings of 
several million euros could be achieved. Further- 
more, this study demonstrates that data mining 
could well be a practical and useful method for 
analyzing and managing productivity in various 
public organizations. However, this requires that 
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an organization already uses measures and surveys 
rather extensively. In addition, there have to be 
high number of organizational units that use the 
similar set of measures in order to have sample 
size large enough. 

There are some issues that should be consid- 
ered when interpreting the results of this study. 
First, some criticism can be made regarding the 
measures used. Measures were notavailable forall 
the potential drivers of productivity. In addition, 
the validity of the measures is not optimal. Valid- 
ity refers to the extent to which a test measures 
what a researcher wishes to measure (e.g. Emory, 
1985, p. 94). We chose to use measures that were 
available, not the most valid ones. For example, 
the measure for productivity does not include all 
output aspects. Especially factors related to the 
quality of service are not very well captured by 
the measure. On the other hand, the practicality 
(cost vs. benefits) of the measures is high. It is not 
untypical that indirect measures have poor validity 
but can provide some information regarding an 
important phenomenon that otherwise could not be 
described at all (Lönnqvist 2004, 96). Second, the 
empirical data does not provide the most typical 
data mining research setting, because the size of 
the sample was not very large. On the other hand. 
the sample actually includes the whole population 
(typical day care centers at Helsinki Region). Con- 
textual issues depending on the region should not 
affect the results of this study. The main purpose 
was not to generalize the results, but to understand 
the phenomenon in this specific context. However. 
the results may also be useful for other public 
service organizations (e.g. elderly care). There are 
similarities between different public services since 
they all need many personnel with close contact 
to clients. In light of the findings of this study, 
it seems that in improving productivity of these 
kinds of services it is necessary to focus on utiliz- 
ing service providing capacity, managing human 
resources and anticipating the needs of different 
customers. It is also necessary to pay attention 
to the size of service providing units. It should 
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also be noted that high customer satisfaction (an 
adicator of service quality) is not necessarily in 
sonflict with high productivity. 
This study represents a small attempt to under- 
sand productivity and various productivity drivers 
» the context of public service organizations. More 
“search is needed. Many issues were excluded 
om the scope of this study. First, the study 
explored the direct linkages between different 
eroductivity drivers and productivity. However, 
be indirect relationships were not taken into ac- 
count. To be able to understand the phenomenon 
more profoundly, these complicated relationships 
need to be analyzed. Second, this study did not 
ake into account the possible time lag between 
‘rivers and productivity. Thus, an important 
‘epic for further analysis is the examination of a 
anger period of time. For example, adding the 
sults for 2008 would allow an examination of 
# three-year period. Third, this study was based 
quantitative data and statistical methods. More 
Setailed information on the productive day care 
venter is valuable. Hence, a fruitful avenue for 
‘orther research is in-depth case studies (e.g. inter- 
ews) in high productivity organizations. Fourth, 
‘his study only used data on one region (a large 
municipal organization). It would be interesting 
ïò carry out the same analysis in the context of 
her similar organizations. The characteristics of 
different organizations and regions may have an 
effect on the role of different productivity drivers. 
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KEY TERMS AND DEFINITIONS 


Driver: Productivity drivers relate to different 
factors having an effect on the level of productivity. 

Measure: Measures are used to provide rel- 
evant, quantitative information for managerial 
purposes. 

Productivity: Productivity refers to the ratio 
between output and input used to produce the 
output. 

Productivity Management: Productivity 
management refers to various managerial activi- 
ties which aim to improve productivity. 

Public Service: Public services include various 
functions like social care, health care and education 
that are organized by central and local government 
and funded with tax revenues. 


Section 2 
Data Mining as Privacy, 
Security and Retention of Data 
and Knowledge 
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ABSTRACT 


Mobile computing is a maturing technology with benefits for consumers. The purpose of this chapter is 
to furnish research on the perceptions of non-information systems students in both America and Europe 
on the impact of mobile computing devices on privacy and security. The chapter expands upon earlier 
researchon only the perceptions of information systems students in America on mobile computing ‘privacy 
and security. This research indicates a higher level of knowledge of the features of mobile computing, 
but lower levels of knowledge of inherent issues of mobile computing and consumer privacy and of 
precaution with mobile computing devices. Findings imply an inadequacy in general curriculum, and 
especially in data mining curriculum, but also an opportunity to improve the curricula. This research 
will benefit educators attempting to improve their pedagogy with syllabi summarized in the chapter that 


integrates contemporary issues of privacy and security with mobile computing technology. 


INTRODUCTION European consumers in the context of location-based 
services. They discuss the benefits of location-based 

The authors of the chapter describe the benefits of services as constrained by the challenge of concerns 
mobile computing devices for both American and of privacy and security with the devices. The data 
l mining of information on consumers by business 

DOI: 10.4018/978-1-60566-906-9.ch006 firms and by governments, as consumers interact 
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and transact with location-based services on or- 
ganizational and governmental applications on the 
devices, is a concern cited by the authors in this 
chapter, consistent with the theme of the Handbook. 

The focus of the chapter is on the findings of 
the authors on perceptions of privacy and security 
of location-based services. The findings are from 
a survey of European and American students who 
were proxy for consumers of mobile computing. 
From the findings, the authors furnish a founda- 
tion for integrating location-based privacy and 
security into data mining and general curricula of 
schools for undergraduate and graduate students 
who are the current and future consumers of 
mobile computing, so that privacy and security 
might be perceived as critical facets of pervasive 
computing in society, a perception that might not 
be evident in the curricula pedagogy of schools. 

The objectives of the chapter are to discuss the 
benefits and concerns of location-based services 
with mobile computing devices, the perceptions 
of privacy and security of the devices by proxy 
students, and the proposed solutions and trends 
with mobile computing devices and services that 
might be integrated into curricula of schools. The 
research in this chapter is important to the field, 
because curricula of schools might not be current 
with organizational and governmental practices of 
data mining that impact, ifnot intrude on, privacy 
and security of mobile computing technology. The 
research helps educators by informing them of the 
perceptions of non-information systems students 
who might not be as knowledgeable of privacy and 
security threats as information systems students. 

The Appendix following the chapter will be 
especially helpful to instructors considering syl- 
labi of privacy regulation and security of mobile 
computing technology. 


BACKGROUND 


Mobile computing applications on mobile com- 
puting devices (MCDs), such as cellular phones, 


laptops, personal digital assistants (PDAs), tablets, 
and other devices, are advancing in beneficial 
features for consumers. Browsing information and 
news, game playing, instant messaging, personal 
and professional e-mailing, and photo and text 
messaging are frequent features on the devices (M: 
Metrics Inc., 2006). These devices have advanced 
from basic cellular phones and PDAs to light 
computing devices interfaced to the Internet with 
information-rich and location-based or enabled 
services. Innovations in mobile computing have 
advanced from cellular payment systems to high 
speed networks in Europe, which is considered 
further along in the development of the devices 
than in America (Lundquist, March, 2007). Mo- 
bile computing with location-enabled services 
is considered by pundits as the killer application 
(Lundquist, April, 2007) and the technical trend of 
2007 integral to consumers (Castells et al., 2007). 
Miniature mobile computing is contributing to a 
new period of pervasive computing (Denne, 2007). 

Data mining involves searching and finding 
hidden patterns in large databases of mostly public 
data to generate profiles based on personal data 
and behavior patterns of citizens and consumers 
(Tavani, 2004). Data mining analysis methods 
evaluate the potential of current customer profiles 
in order to facilitate future customer prospecting 
and sales. Much of the data that is mined today 
is either public or semi-public — our supermarket 
purchases, surfing habits, salary, location, and 
other such information. The main ethical issues in 
data mining are that consumers are not generally 
aware their data is being gathered, do not know 
the uses to which the data will be made, or have 
not consented to the use of such data. 

Presently, in the United States, there are 
limited legal restrictions on the use of personal 
data for data mining. Other than the protection 
of healthcare data under the 1996 Health Insur- 
ance Portability and Accountability Act (HIPPA), 
financial data under the 1999 Gram-Leach-Bliley 
Act, or the protection of children while on-line 
under the Children’s Online Privacy Protection 
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Act (COPPA), there are no federal laws that pro- 
tect data mined by firms and organizations. On 
the other hand, the European Union’s European 
Directive 95/46/EC strictly controls the gathering 
and use of private data by firms and organizations 
operating within the European Union (European 
Directive 95/46/EC, 1995). 

Location-based privacy is the freedom to 
limit control of one’s whereabouts. This concept 
is derived from the notions of accessibility pri- 
vacy (knowing one’s location can be construed 
as a type of intrusion) and informational privacy 
(one’s location can be construed as private infor- 
mation). Location data gathered from cell phone 
and other GPS enabled devices are not clearly 
protected by either United States federal law or 
European Union legislation (Molluzzo & Lawler, 
2007). Where one is located and when are not the 
only things that can be inferred from geographic 
positioning system (GPS) data. Studies by Eagle 
and Pentland (2006) and others have indicated 
that diverse social patterns for individuals and 
groups, such as when one leaves home for work, 
when one leaves work for home, when one plans 
acircle of friends, and so on, can be inferred from 
cell-phone data. This process has become known 
as reality mining. 


CONCERNS AND ISSUES 


The benefits of location-based services are 
coupled, however, with concerns about control of 
personal and private information on the mobile 
devices and by perception of frequent incidents 
on the devices of likely identity theft and intrusion 
on the privacy of consumers (Grossman, 2007). 
Privacy activists, such as the Electronic Privacy 
Information Center and the European Commis- 
sion, cite fundamental issues in the mismanage- 
ment, marketing and mining of information on 
consumers. They cite issues in the monitoring 
of consumers by business and carrier firms and 
by governments from information retained from 
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interactions or transactions (Eggen, 2006). The 
monitoring is increased in the field of collective 
intelligence that is integrating information from 
mobile computing devices and from the Internet. 
This level of monitoring might enable further 
decreased expectation of privacy, in an increased 
Big Brother environment more Orwellian than the 
level of data mining, if collective intelligence is 
misused by governments or by business firms. 

Issues may include employee monitoring 
(Hamblen, 2007). RFID is not infrequently con- 
sidered by pundits and researchers as synonymous 
with surveillance (Curtin et al. 2007). Further 
issues include networks and systems behind 
the services that might be hacked by intruders, 
phishers, spammers and stalkers (Brandt, 2006) 
but not disclosed by firms when they learn of the 
hacking. Firms might lose mobile devices having 
information on customers because of internal loss 
or theft (Pratt, 2007). Firms might lose customers 
because of this (Romano & Fjermestad, 2007). 
Clearly the benefits of location-enabled services 
can be considered paltry when contrasted with 
issues on privacy and security (Stross, 2006). 

The impact of the concerns on location-based 
services may eventually hinder the deployment 
of mobile computing in the marketplace and in 
society. Concerns of access of information or of 
location beyond the carrier, firm or government 
and beyond known collaborators in the absence 
of the knowledge of consumers are considerations 
in the design of location-enabled services on 
governmental and organizational applications. 
Consumers continue to have concerns about in- 
formation interacted on the Internet (Sraeel, April, 
2007). Consumers may not have confidence in 
the privacy and security of location services on 
their mobile computing devices or in regulation 
already considered by legislators to not include 
MCDs (Hines, 2007). The lack of confidence 
may impede pervasive computing as a trend if 
improved control of information and of privacy 
is not implemented in the field by information 
systems practitioners. 
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RESEARCH AND SOLUTIONS 


This chapter introduces a framework for prac- 
titioners and instructors in integrating issues of 
location-related privacy with mobile computing, 
so that pervasive computing continues in society 
to be a bona fide trend. 

In 2007 research (Molluzzo & Lawler 2007), 
the authors analyzed the knowledge of undergradu- 
ate and graduate information system students of 
the impact of mobile computing on location-based 
privacy. In this chapter, the research is extended 
to non-information systems undergraduate stu- 
dents, in the United States and in Europe. Both 
non-information systems and information systems 
populations are Generation Y, Millennial or Net 
Generation students under the age of 28, but are 
effectively knowledgeable consumers of mobile 
computing devices to be proxy in exploratory re- 
search and in initial study of consumers in society. 
The extension of the research to non-information 
systems students is importantto the field, inasmuch 
as information systems students may be more 
experienced on factors of privacy and security 
because of their potential further knowledge of 
the technology. Such experience may be different, 
ifnot less, with non-information systems students 
and inevitably limiting in the knowledge of these 
students in privacy and security and in technology, 
which might argue that pedagogy be improved in 
schools of universities educating non-information 
students on contemporary issues of technology. 

The survey was administered on-line to stu- 
dents at Pace University in the United States in 
November, 2007. These students were mostly 
first and second year undergraduates taking an 
introductory computing course that is required 
of all students at the university. In the spring of 
2007, the survey was administered to the European 
students, all of whom were undergraduate business 
majors at the University of Mons, Belgium, using 
hard-copy. There were 75 completed United States 
surveys and 19 completed European surveys. The 
survey instructions asked the respondents to limit 


their responses to their experience using mobile 
computing devices (MCDs), excluding dedicated 
audio devices, such as the iPod. The survey was 
administered anonymously — the respondents’ 
names were not collected by the authors. 

The survey was divided into several sections: 


° Background Questions to gather demo- 
graphic data; 

e Objective Questions on the importance of 
using mobile devices for various purposes; 

° Knowledge Questions on respondent 
awareness of the privacy issues of loca- 
tion-based data collection; 

° Concern and Control Questions about the 
protection of consumer privacy by govern- 
ment and wireless providers; and 

e Summary Question to gauge the respon- 
dents’ overall concern for privacy. 


These questions were based on and were 
equivalent essentially to those in the 2007 survey 
(Molluzzo & Lawler, 2007). 


BACKGROUND QUESTIONS 


Of the European students, 13 were female and 6 
were male; of the United States students, 49 were 
female (65%) and 26 (35%) were male students, 
which is close to the general Pace University 
population ofapproximately 60% female and 40% 
male students. The academic majors ofthe United 
States students were varied: business (48%), liberal 
arts (13%), computing (8%), nursing (4%), and 
other (27%), which again reflects the general Pace 
University undergraduate population of students. 
The European students were all business majors. 
The average age of the United States students was 
19, while the average age of European students 
was 22. European students reported using their 
mobile devices for an average of 6 hours per day. 
United States students reported using their mobile 
devices on average for 7 hours per day. 
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Table 1. Frequency of use (percents) 
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For the below analysis, European and United 
States groups of students were asked the follow- 
ing questions: 


Objective Questions 


The survey attempted to discern how students use 
their MCDs. Therefore, objective questions were 
asked regarding the respondents use of MCDs. 

Frequency of Use: Respondents were asked 
to “rate how frequently you use your Mobile 
Computing Device” and for what reasons. The 
answers were based on a five-point Likert scale. 
Each entry in the table, see Table 1, is a percent of 
those answering the question either “Frequently” 
or “Very Frequently. The most frequently used 
service is for Social Contacts with Business/ 
School not far behind it. The lowest used service 
was for Games. 

Use of Location-Enabled features of MCD: 
Respondents were asked to “rate the frequency 
with which you use the location-enabled features 
of your mobile computing device.” The answers 
were based on a five-point Likert scale. Table 2 
indicates the percent of respondents answering 
“Frequently” or “Very Frequently”. The least used 
features were driving directions and e-banking, 
possibly because of the low average age of the 
students. Also, possibly because of the low aver- 
age age of the students, texting was the most 
frequently used feature, with e-mail and instant 
messaging following. 
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Table 2. Use of MCD Features (percents) 
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Private Information: Respondents were 

asked “What private information do you store on 
your mobile computing device(s)?” A list of pos- 
sible data was presented. Although some per- 
sonal information is inevitably stored on a MCD, 
it is interesting to note that some respondents save 
highly confidential data on their MCD. For ex- 
ample, three people (out of the 75 United States 
students) store their social security numbers, five 
people store unencrypted account passwords, three 
people store credit card numbers, and five store 
bank account numbers. 


Knowledge Questions 


Privacy: The respondents were asked several 
questions that rate their knowledge about vari- 
ous privacy concerns using MCDs. The answers 
were based on a five-point Likert scale. The only 
statement with which a majority of respondents 
“Agree” or “Strongly Agree” (55%) is that a 
“location-based mobile computing device can 
monitor your exact location.” Even this number 
is very low considering that GPS technology is 
constantly in the news. Table 3 indicates the results. 
It is very interesting that such large percentages 
of students do not consider wireless Internet ac- 
cess (53%) and GPS systems (64%) as possible 
threats to their privacy. 
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Table 3. Privacy (percents) 


Agree or 
Strongly 


MCD Location Data Can Be Marketed to 
Other Firms 


Wireless Internet Access Can Intrude on Privacy 
Email Can Intrude on Privacy 


| Do you know what your provider will do if your 
| information in compromised? 


Have you expressly Opted-out on your mobile 
contract? 


Do you know the procedure your provider uses to 
safeguard your personal information? 


Do you read the privacy policies before signing 
the contract? 


Wireless Provider Policies: The respondents 
were asked several questions on their relationship 
with their wireless provider. These were Yes - No 
questions. The respondents’ answers here are in 
line with the Privacy questions in Table 3, which 
indicate a low degree of awareness of important 
privacy issues when using MCDs. This set of 
questions indicates a high degree of complacency 
and lack of knowledge among the respondents 
regarding the actual privacy policies of their 
mobile carriers. For example, 100% do not read 
their carrier’s privacy policy; and 74% do not 
know what their provider will do if their informa- 
tion is compromised. The results are indicated in 
Table 4, where numbers represent percents who 
answered No to the questions. 


Table 5. Trust and advertising (percents) 


Agree or 
Strongly 
Agree 


I like the idea of mobile advertising messages. 


I like the idea of mobile advertising if the ad- 
vertising is meaningfully personalized to me. 


I am concerned about location-based privacy 25 
when using my MCD. 
I am comfortable that my provider will protect 30 
my privacy. 


I am confident that government regulations will 34 
protect my privacy. 
I am concerned about identity theft. 56 


Concern and Control Questions 


Trust and Advertising: The respondents were 
asked to “rate their level of agreement with the 
given statement.” The answers were based on a 
five-point Likert scale. See Table 5. Respondents 
show mistrust that their provider (30% Agree or 
Completely Agree) orthe government (34% Agree 
or Completely Agree) will protect their privacy. 
A fairly low percentage, 56%, either Agree or 
Completely Agree that they are concerned about 
identity theft. Interestingly, however, only 25% 
Agree or Completely Agree that they are con- 
cerned about location-based privacy. It seems the 
respondents do not yet consider location-based 
data to be personal information. 

Mobile advertising did not get a strong vote 
of confidence from the respondents. Only 7% of 
the respondents either Agree or Strongly Agree 
that they would like mobile advertising messages. 
Even if the advertising were targeted and person- 
alized, only 15% of the respondents Agree or 
Strongly Agree. Both results speak to a lack of 
trust among students of Internet and mobile-based 
advertising. 

Protecting Your Mobile Device: The respon- 
dents were asked how they protect their mobile 
device? They were provided with a list and were 
asked to check all thatapply. Not many respondents 
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encrypt data on their MCD. Only 12% encryptall 
data, 10% encrypt all business-related data, and 
11%encryptall sensitive data. Also, only 17% use 
encryption when connected toa wireless network. 
However, 44% lock access to their MCD using a 
strong password, and 32% set the MCD to auto- 
lock when not in use for a specified time. Many 
respondents, 62%, keep their MCD hidden when 
traveling, but only 29% do not access private or 
business data in public places. Finally, only 30% 
of respondents remove all data on their MCD 
before discarding or turning it in. 

Summary Question: Respondents were asked 
“which of the following statements best describes 
your feelings about privacy?”: I feel strongly 
about privacy (41%); I feel strongly about privacy 
but may benefit from surrendering my privacy 
at times if my privacy is not abused by a firm or 
service (54%); and I do not feel strongly about 
privacy (5%). 

These questions are based on categorization of 
subjects into privacy fundamentalists (first ques- 
tion), privacy pragmatists (second question), and 

privacy unconcerned (third question) from extant 
field research (Westin, 1996). In a Harris poll 
conducted in 2003 (Taylor, 2003), the percentage 
of respondents in the three categories were 26%, 
64% and 10%. The distribution from our results 
was 41%, 54% and 5%. This indicates that our 
population of students might be more informed 
about privacy issues than the general population, 
or in the intervening four years the general popula- 
tion has been made more aware of privacy issues. 

It is to be noted that the only students who 
do not feel strongly about privacy were from 
Europe. Only two European students out of 19 
students (10.5%) consider themselves privacy 
fundamentalists as compared to 41% of students 
in the United States. This could be a result of the 
fairly strict privacy laws of the European Union, 
as defined in Directive 95/46/EC (European Di- 
rective 95/46/EC, 1995). 

This research also attempted to determine 
whether there are significant differences between 
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the 77 information systems students from the 
research in 2007 (Molluzzo & Lawler, 2007) 
and the non-information systems students from 
the present research. To determine if there were 
such differences, a chi-squared test for indepen- 
dence was determined to be suitable. Many of the 
questions in the survey utilized a 5-point Likert 
scale. However, the small sample sizes made the 
chi-squared test for significant differences using 
the five Likert categories unusable. Therefore, the 
5-point scale was compressed in the study. The 
three lowest Likert scores (Strongly Disagree, 
Disagree, and Neutral) were converted to 0 and 
the two highest (Agree and Strongly Agree) were 
converted to 1. This enabled use of the chi-squared 
test for independence on 2x2 tables, which did 
yield valid results. It is to be noted that all sig- 
nificance measures are for two-sided p-values. 


Information Systems vs. Non- 
Information Systems Students 


MCD Use: There were statistically significant 
differences between information systems and 
non-information systems students at the p <.001 
level of significance in three of the uses of their 
MCD devices. These uses were for emergency, 
storing digital media, and e-banking, as indicated 
in Table 6. 

Non-information systems students tend to use 
the social functions of their MCDs more than 
computing students. For instant messaging, text 
messaging, contacting family and friends there 
were significant differences in use at the p <.001 
level. For social contact uses there was a signifi- 
cant difference at the p <.01 level. 

There were also some significant differences 
in what students store on their MCDs. There is a 
difference at the p <.001 level of significance in 

storing their age and storing their school name. 
There is a difference at the p <.01 level in storing 
their place of employment. 

In all the above cases, except for use in emer- 
gencies, non-information systems students tend 
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Table 6. Significant differences in uses between 
computing and non-computing students 


Emergency 


Storing Digital Media 


Instant messaging 


Texting 


Contacting Family and 


Friends 
Social Contacts 
Storing User’s Age 


Storing User’s School Name 


to use their MCDs for the reasons indicated more 
than information systems students. These uses, 
such as storing digital media, using MCDs for 
social contacts and storing personally identifiable 
information, suchas age and place ofemployment, 
indicate that non-information systems students are 
more apt to use their MCDs for social reasons. 
Privacy Awareness: There was a significant 
difference between information systems and non- 
information systems students in the awareness 
of how MCDs can affect privacy, as indicated in 
Table 7. Information systems students were sig- 
nificantly more aware (at the p <.001 level) that 
wireless access and GPS services can intrude on 
privacy. Non-information systems students were 
also significantly more concerned about identity 
theft (at the p <.001 level.) There was also a dif- 
ference at the p <.05 level in the knowledge that 
an MCD can monitor your exact location. 
Mobile Advertising: The only question related 
to wireless provider in which there was a sig- 
nificant difference between information systems 
and non-information systems students at the p 
<.05 level was whether they like the idea of mo- 
bile advertising, as indicated in Table 7. Non- 


Storing User’s Place of 
Employment 


Table 7. Significant differences in privacy aware- 
ness between computing and non-computing 
students 


Wireless Access can intrude on 
Privacy 


MCD’s Can Monitor My Exact 
Location 


I Like the Idea of Mobile 
Advertising 


Table 8. Significant differences in control questions 
between computing and non-computing students 


Encrypt All Data Stored on 
MCD 


Use Encryption When Con- 
necting to a Wireless Network 


Remove All Data Stored on 
MCD Before Discarding or 
Handing In 


information systems students seem to like the idea 
more. 

Control Questions: There were also significant 
differences in the ways the two groups protect 
their MCDs, as indicated in Table 8. In the use 
of encryption, either in storing sensitive data or 
connecting to a wireless network, there were 
differences at the p <.001 level of significance. 
Similarly, there was a difference at the p <.01 level 
in the encryption by the groups of business data 
on their MCDs. Also there was a difference at the 
p <.05 level of significance in encrypting all the 
data on their MCDs. In all the above cases, infor- 
mation systems students were more likely to use 
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encryption. Finally, there was a difference at the p 
<.05 level of significance in whether the students 
remove all data on their MCD before discarding 
the device or returning it to the provider. Once 
again, information systems students were more 
likely to remove information from their MCDs. 

The differences in results among information 
systems and non-information systems students 
show that information systems students are more 
aware of location-based privacy issues than is the 
general student population. Clearly the general 
population needs to be made aware of the pri- 
vacy and security issues surrounding the use of 
MCDs. The importance of the present research is 
that the results indicate that pedagogy needs to 
be improved in schools of universities educating 
non-information systems students on issues of 
privacy and security of technology. 


United States vs. European Students 


This research further investigated the differences 
between United States and European students in 
their use of MCDs. Because of the small sample 
size of European students (n = 19), only results 
that were significant at the p<.01 orp<.001 levels, 
with one exception noted below, are indicated in 
Table 9. Even in these cases, it is not advisable 
to make generalizations from the results without 
caution. 

On the use of MCDs for accessing weather, 
business/school, and family and friends; only one 
European student in all three categories answered 
yes (p <.001). The interpretation of this is not 
clear from the result. Further investigation is 
necessary in a larger sample of students. Several 
other MCD use differences were observed at the 
p <.01 level of significance, such as for emergen- 
cies, storing and sharing digital media, and search- 
ing. European student use is far less than that of 
United States students. 

Only one other difference at the p <.05 level 
of significance was observed in the results. That 
is, European students are not as concerned about 
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Table 9. Significant differences between U.S. and 
European students 


a e 


Contacting 
Business or 
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Contacting 
Family and 
Friends 
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identity theft as are their United States counter- 
parts. This difference could be due to the European 
Union’s stronger privacy laws (European Directive 
95/46/EC, 1995). 


Storing and 
Sharing Digital 
Media 


[sme |__| 


Concern | I Am Concerned 
Question | About Identity 
Theft 


FUTURE TRENDS 


Atrend emerging from the research in this chapter 
is the clear knowledge of information systems 
students of the fundamental functionality of 
mobile computing devices compared to a lesser 
knowledge of these issues by non-information 
systems majors. The non-information systems 
students were not as knowledgeable in both the 
processes and practices ofmobile computing firms. 
These trends imply a likely lower sensitivity of 
non-information systems students to the larger 
impact of mobile computing technology on society 
as compared to information systems students. 
Another trend from the research is an incon- 
sistency in the higher knowledge of both groups 
of students in the processes of location-based 
mobile computing technology in contrast to lower 
personal precaution with the technology. The 
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students were not as diligent as expected in the 
confidentiality and protection of information on 
mobile computing devices, which is not distinct 
from the inconsistency of non-student subjects in 
follow-up of intrusions of privacy (Sraeel, May, 
2007). Though they felt the generic importance 
of privacy, the students were not fully protective 
of their devices through recognized security tech- 
niques. This lower diligence in precaution was 
not an encouraging example for the management 
of the privacy and security of mobile computing 
technology. The trends imply a lower sensitivity 
to the non-technological impact of mobile com- 
puting as a societal tool, a theme that continues 
in the research. 

Further trends from the research include the 
potential opportunity to improve the mobile com- 
puting syllabi of information systems and business 
ornon-information systems instructors, in order to 
mitigate deficiencies in knowledge. The students 
may learn more of the impact of marketing, mining 
and business practices that mobile computing firms 
and retailers might apply from innovations in mo- 
bile computing technologies, if schools improved 
their information systems syllabi. Information 
systems students may also learn more of privacy 
and security issues and techniques with mobile 
technology (Taylor, 2007). Moreover, they may 
be encouraged as future practitioners and profes- 
sionals by their instructors to be more sensitive 
to regulatory and societal themes. These trends 
imply minimally that an improvement is needed 
in mobile computing and information systems 
and business or non-information systems syllabi, 
which is furnished in the Appendix of this chapter. 

Other trends to be elaborated from the research 
in the chapter include the clear knowledge of the 
fundamental functionality of mobile computing 
devices. The information systems students were 
knowledgeable in the processes ofmobile comput- 
ing firms. This knowledge was, however, indicated 
to be not as clear and to be lower in the probable 
privacy and security practices of the firms. The 
general student population seems to be much 


less aware of both the processes and practices 
of mobile computing firms. These trends imply 
a likely lower sensitivity to the larger impact of 
mobile computing technology on society. 

A final trend of the research is an inconsistency 
in the higher knowledge of the information sys- 
tems students in the processes of location-based 
mobile computing technology in contrast to lower 
personal precaution with the technology. These 
students were not as diligent as expected in the 
confidentiality and protection of information on 
mobile computing devices, which is not distinct 
from the inconsistency of non-student subjects in 
follow-up of intrusions of privacy (Sraeel, May, 
2007). Though they felt the generic importance 
of privacy, the students were not fully protective 
of their devices through recognized security tech- 
niques. This lower diligence in precaution was 
not an encouraging example for the management 
of the privacy and security of mobile computing 
technology. The trends imply a lower sensitivity 
to the non-technological impact of mobile com- 
puting as a societal tool, a theme that continues 
in this research. 


Limitations and Opportunities 


The main limitations to the study described in 
this chapter are the scope and size of the study 
sample. To make more general conclusions the 
sample would need to be widened in scope to 
include the general population, not only college 
students, and would need to reflect the composi- 
tion of the general population of MCD users in 
the United States and Europe. Also, to go beyond 
the preliminary study of this chapter, the number 
of subjects involved in the study would have to 
be increased to the point where reliable statistical 
analysis can be applied. 

Each of the limitations described provides an 
opportunity for further research. A broad survey 
of the general population of the United States 
and Europe having a large enough sample would 
produce interesting results in that population’s 
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knowledge, uses and concerns about privacy 
and security in the use of their mobile computing 
devices. Moreover, similar studies done in other 
areas of the world would be of great interest. 

Finally, there are other opportunities for further 
research. Inasmuchas it was noted, in Concern and 
Control Questions of the Research and Solutions 
section of the chapter, that the survey seems to be 
indicating students do not consider location-based 
data to be personal information, this attitude might 
soon change with the advent of new services, 
such as Google Latitude and Loopt, which allow 
monitoring of physical locations of others. In the 
same section, it was even noted that there is a lack 
of trust in mobile advertising. Future study might 
investigate the basis for the mistrust. Lastly, in the 
results of the research, in Protecting Your Mobile 
Device, it was noted that most of the students do 
not adequately protect their MCDs or remove 
sensitive information from those devices when 
they are discarded by them or returned by them 
to the providers. Why is the general population 
of students careless about the protection and the 
security of the information stored on MCDs? In 
what manner might the public be informed of the 
issues of non-protection and non-security? The 
field is ripe for new research. 


CONCLUSION 


The research in this chapter analyzed the learn- 
ing and non-learning of non-information systems 
students on location-based privacy with mobile 
computing and compared these findings to an 
earlier study of information systems students. 
Also indicated in the research is a lower level of 
knowledge of non-information systems students 
of inherent data mining methods and practices that 
might intrude on privacy and security. Essentially 
non-information systems students are indicated to 
be less knowledgeable of organizational practices 
of privacy and security on mobile computing 
technology. 
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Overall, the importance of the results and the 
trends are in the necessity of improving non- 
information systems curricula, especially data 
mining curriculum, in universities integrating 
organizational and governmental practices of 
privacy regulation and security into societal- 
sensitive syllabi. The results of this chapter also 
indicate a number of challenges for personal 
privacy, security, how commercial Web sites and 
organizations handle the privacy of personal data, 
and how software is developed that incorporates 
privacy into the design phase of its development. 

Data Security and Privacy Challenges: There 
are now location-aware applications that can let a 
person’s circle of friends know her location. The 
social networking application Loopt (www. loopt. 
com) uses the GPS chip inaperson’s MCD to place 
her location on a map. She can then allow friends 
to see where she is and they can allow her to see 
where they are located. While this application and 
others like it, forexample Google Lattitude (www. 
google.com/lattitude), allow a person to meet up 
with friends, there are a number of privacy con- 
cerns with its use. For example, after turning on 
the application, the user might forget to turn it off 
thus giving away her location when she does not 
want it known. Also, it is possible for someone 
else to turn on the application and monitor her 
location without her knowing it. Even if she does 
not allow anyone to use her phone, while surfing, 
it might be possible for an application to load 
onto her device, detect the presence of Loopt and 
turn it on without her knowledge. Although this 
last scenario is hypothetical, it is not outside the 
realm of possibility. For example, if a Web site 
uses Adobe Flash to place animations on a Web 
page, Flash can actually detect the computer’s 
microphone and Webcam and make adjustments 
to them! (Adobe, 2009). 

In addition to its real-time availability, one’s 
location is stored in the application vendor’s 
database. Both Loopt and Google Lattitude say 
that they do not store historical locations — only 
a user’s last location. While this would prevent 
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historical location data from being used by the 
application vendor and from it being subpoenaed 
by the governmentora civil or criminal investiga- 
tion, it is only company policy. Company policy 
can change, especially if user location data can 
be seen as a source of advertising revenue. In this 
situation, a vendor might decide to anonymize the 
location data and then use it or sell it to market- 
ing companies so it can be mined for commercial 
purposes. However, even when anonymized, 
recent research has shown that the mined data 
can be linked to individuals (Bonneau, 2009). 
This shows that anonymity is not equivalent to 
privacy. Also, we cannot count on federal laws 
to help protect our location data. Paul Stephens, 
director of policy and advocacy at the Privacy 
Clearinghouse, recently remarked that “These 
location-tracking services have been available 
since about 2005 and the laws haven’t caught up 
with the technology.” (Farrell, 2009). 

Research conducted at the MIT Media Lab 
has developed techniques that combine conver- 
sation, location, and temporal data to do social 
network analysis. The original experiment (Eagle 
and Pentland, 2006) involved collecting a huge 
amount of data from 100 mobile phones over 
nine months. The phones acted as wearable sen- 
sors giving the location and other data of their 
users. The analytic technique employed, called 
reality mining (http://reality.media.mit.edu/), can 
be used to predict the behavior of organizations, 
groups of individuals, and even to predict what 
a single user will do next. Many organizations, 
including telecommunications companies, have 
the computing power, data storage, data mining 
capacity, and expertise to duplicate the MIT ex- 
periment using their customer data. Therefore, in 
the not-too-distant future, it might be possible for 
such organizations to reality mine their customer 
data to make accurate predictions of individual 
behavior. This is certainly a point of concern that 
the public and government should be made aware 
of prior to its possible implementation. 


Another area of concern is the spread of “spear 
phishing” attacks. A recent incident will illustrate 
the problem. In the fall of 2008, about 10,000 
members of LinkedIn, a social networking site 
for professionals, were targeted by customized 
emails sent to specific individuals. The emails, 
which were made to look like they came from sup- 
port@linkedin.com, asked the recipients to open 
amalicious attachment that supposedly contained 
the names of business contacts (Krebs, 2008). 
Such attacks are made possible by scammers 
mining the data contained in social networking 
sites. Once such data is linked with location data, 
spear phishing attacks will certainly arrive on cell 
phones and other mobile devices. 

The research of this chapter also shows that 
the general public needs to be made more aware 
of privacy threats in general and in particular 
threats associated with mobile computing. As 
noted previously, most of the survey respondents 
in the chapter study were not aware of the pos- 
sibility of the intrusiveness on privacy that their 
mobile devices can be. Apparently, most people 
are also either unaware or are complacent about 
data breaches. The Identity Theft Resource Center 
recently announced (ITRC, 2009) an increase of 
reported data breaches from 446 in 2007 to 656 
in 2008, an increase of 47%. Possibly one of the 
largest and potentially most sensitive data breaches 
came in November of 2008 when Express Scripts, 
a company that manages drug benefits for health 
care providers, announced that extortionists were 
threatening to make public millions of health 
records, complete with names and social security 
numbers. Express Scripts posted a $1 million 
reward for the capture of the extortionists, but as 
of the writing of this chapter, the FBI has yet to 
capture them. (Express Scripts, 2009) Why is there 
no public outcry about such situations? It could 
be that efforts to inform the public about privacy 
breaches have actually backfired. The more than 
300 federal and state privacy-related laws have 
resulted in a deluge of privacy notices (for example 
224 notices were sent to Maryland residents from 
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mid 2008 to mid-2009.) This information overload 
has perhaps caused consumers to care less about 
privacy protection (Gomes, 2009). The research 
in this chapter indicates that a possible solution 
is proper education of the public and computer 
professional though a curriculum that emphasizes 
privacy awareness. 

Organizational Challenges: As described ina 
Motorola White Paper (Motorola, 2007), mobile 
devices are a productivity boon and an enterprise 
risk. MCDs enable a mobile work force. Those 
who use MCDs most are managers and sales and 
service personnel, who are the people most likely 
to access sensitive and proprietary data. This data 
is frequently stored on the user’s MCD, which is 
a fact not very comforting because nearly 30% of 
organization-owned MCDsare lost every year. The 
data stored on those devices include passwords, 
employee records, sensitive emails, business plans 
and so on. Often the data is not encrypted. The so- 
lution to this security problem is to adopt stringent 
security standards and practices for all MCDs. At 
the very leastan MCD should be accessible to only 
authorized users. All MCDs should be protected 
by requiring a strong password, which should be 
different from other passwords employed by the 
user of the MCD, and which should expire after a 
period of time. All data stored on the MCD should 
be encrypted automatically. The organization’s 
information systems staff should also have the 
capability of remotely locking the device and 
erasing its contents should the employee quit the 
organization or should the device be lost or stolen. 

The research described in this chapter has 
also revealed a number of challenges for orga- 


nizations that use data obtained by people using 


MCDs. Organizations must earn the confidence 
of consumers on how their data is obtained and 
used. A recent Harris Poll (HarrisInteractive, 
2008) showed that 59% of adults polled were 
not comfortable with Web sites using a person’s 
online activity to tailor advertisements based on 
the person’s interests. Clearly, online companies 
need to earn the confidence of consumers when 
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using consumer data. Organizations must make 
more clear what data will be collected, how it will 
be used, how long it will be stored, and with whom 
the data will be shared. This needs to be done by 
making their privacy policies easily available and 
writing them in easy to understand language. The 
language used must avoid imprecise words that 
obfuscate the meaning of the privacy policies 
(Pollach, 2007). 

It is also almost impossible for Web surfers to 
avoid having their data tracked and collected by 
Web sites. Almost all Web sites have only an opt- 
out option for data that they collect on the site’s 
visitors. However, Jon Leibowitz, the chairman 
of the Federal Trade Commission, which oversees 
matters of online privacy and data collection, has 
warned online ad companies that he would like 
to see them obtain explicit permission from Web 
surfers to track them online (Davis, 2009). This 
is one of the few times that a high-ranking federal 
officer has expressly lobbied for an opt-in policy. 
It would, therefore, behoove the online ad industry 
to try to develop workable, understandable, and 
effective privacy policies before the federal gov- 
ernment decides on regulations that might limit 
the effectiveness of these organizations. 

Finally, the use of MCDs can have long-term 
and unintended effects on our society. The use 
of MCDs as an educational tool has been written 
about extensively. Students can access the many 
text, image, audio and video resources of the Web 
to enrich their classroom learning and textbooks. 
Nearly every school in the United States has many 
computers with Internet access. Some are “laptop 
schools”, requiring all students to bring a laptop to 
every class. (This isa trend that will expand greatly 
with the advent of sub-$400 “netbooks”.) Some 
are now experimenting using handheld devices 
for student Internet access. However, there is a 
downside to the use of MCDs among elementary 
and high school children. As reported in the New 
York Times (Kafner, 2009), many experts are 
concerned about the use of texting by teenagers. 
The average teenager in the fourth quarter of 2008 
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sent or received an average of 80 text messages 
a day. Assuming a normal teen is awake for 16 
hours each day, this means the average teenager 
sends or receives 5 text messages each hour or 
one every 12 minutes — even during class time! 

Psychologists are concerned that excessive 
use of texting means possible loss of sleep, and 
a possible shift in the way adolescents develop. 
Because keeping in touch with parents is made 
easy by texting, adolescents may not be separating 
from their parents sufficiently for their develop- 
ment into mature, autonomous adults. Excessive 
texting also makes it difficult for teens to enjoy 
moments of solitude and to have time to reflect 
because they are constantly hearing their phone 
announce the arrival of still another text message. 
Will excessive texting have a long-term effect on 
teens? This situation also has implications for the 
workforce of the future. Will organizations be able 
to recruit enough mature, independent individu- 
als who can think creatively and independently? 

In 2003 the Computing Research Associa- 
tion declared that to “enable trusted systems for 
important societal applications” is one of the 
four grand challenges of trustworthy computing 
(CRA, 2003). To achieve this goal, the authors of 
this chapter believe that privacy must be included 
as an integral component in the design stage of 
software. This is not, however, an easy task. The 
notion of privacy is very complex and involves 
a combination of personally, culturally, and 
socially constructed properties of relationships 
and groups (O’Sullivan, 2000). Thus, software 
developers need to move away from the notion 
that privacy can be accomplished by a small, dis- 
crete set of software settings and move towards 
a privacy calculus that controls the development 
of boundaries and the disclosure of confidences 
(Petronio, 2002). A first step in such a paradigm 
shift is the proper privacy education of software 
professionals. 
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KEY TERMS AND DEFINITIONS 


Curriculum Design: Methodology ofanalyz- 
ing current data mining and general curricula for 
expansion and integration of evolving govern- 
mental and organizational practices of mining 
on mobile computing devices that might impact 
if not intrude on privacy and security; 

Data Mining: Method of evaluating data pat- 
terns of interactions and transactions on mobile 
computing devices for future marketing of more 
customized and personalized products and ser- 
vices and for further monitoring of movements 
of citizens, consumers and employees; 

Location-Based Privacy: Method of defining 
personal and professional and otherwise private 
protected interactions and transactions on mobile 
computing devices that might be integrated into a 
broader definition of privacy covered in curricula; 

Location-Based Security: Methodology of 
defining and enabling protective measures and 
routines on mobile computing devices that might 
be integrated into a broader definition of regula- 
tion and security covered in curricula; 

Location-Based Services: Method of facilitat- 
ing feature functionality of products, services and 
tools on mobile computing devices consistent with 
the evolution of pervasive technology; 

Mobile Computing: Method of furnishing 
products, services and tools through increasingly 
pervasive and portable technology; and 

Radio Frequency Identification Devices 
(RFID): Method of enabling immediate location- 
based services. 
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APPENDIX 

Framework of Syllabi 

Location-Based Privacy with Mobile Computing 

Module 1: Architecture and Applications of Mobile Computing 


Bluetooth 

Global Positioning Systems (GPS) 

Radio Frequency Identification Tags (RFID) 

Short Messaging Services (SMS) 

Wireless Application Protocols (WAP), Broadband (WiMax) and Local Area Networks 


Module 2: Design and Development of Mobile Computing Applications 


Graphical User Interface (GUI) 
Java 2 Micro Edition (J2ME) 
Multimedia 

Palm Operating System (OS) 
Symbian Operating System (OS) 
Windows CE 

Voice over Internet Protocol (VoIP) 


Module 3: Privacy of Mobile Computing Applications — Enhancement to Syllabi 
Citizen and Consumer Constructs 


Definitions of Privacy 
Functions of Privacy 


Ethical Constructs 


Ethics Management 

Ethics of Profiling 

Ethical Use and Mining of Consumer Data 
Integrity Management 

Levels of an Ethical Organization 


Governmental Constructs — United States 
United States Constitution 

Court Decisions 

Federal Legislation 


State Legislation 
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Governmental Constructs — European Union — 


European Commission Directives 
Member Nation Legislation 


Methodological Constructs 


Chief Privacy Officers (CPO) 

Digital Identity, Identity Layers, Liability and Rights Management 
Human Factor Failures 

Platform for Privacy Preferences 

Pretty Good Privacy (PGP) 

Privacy Organization Standards 

Privacy Policies 


Technological Constructs 


Privacy Aware Technologies (PAT) 
Privacy Invasive Technologies 
Privacy Software Technologies 


Module 4: Security of Mobile Computing Architecture 
and Applications — Enhancement to Syllabi 


Chief Security Officers (CSO) 
Information Protection and Security 


Authorization 

Availability 

Confidentiality 

Integrity 

Non-Repudiation 

Public Key Infrastructure (PKI) 


Security Protocols 


Secured Socket Layers (SSL) 

Transport Layer Security (TLS) 

Wireless Transport Layer Security (WTLS) 
Multifactor Security 

Digital Watermark 

Key Recovery 

Smartcard Security 
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Mutual and Spatial Authentication 
RFID Security 
Mobile Agent Security 


Security Techniques 


Ciphering 

Cryptography 

Hashing Algorithms 

Security Policies 

Solutions and Threats to Security and Trust 


Module 5: Mobile Computing Societal and Technological Trends 


“Big Brother” 

Biometrics 

e-Passports 

Loyalty and Travel Cards 

National Identity Cards 

Reality Mining 

Privacy and Surveillance in Era of Terrorism 


Reference Research Sites for Syllabi 


www.bentley.edu/research 

www.bsr.org — Business for Social Responsibility 

www.cdt.org — Center for Democracy and Technology 

www.corpwatch.org — Watchdog on the Web 

www.depaul.edu/ethics - Institute for Business and Professional Ethics 

www.ebnsc.org — Corporate Social Responsibility in Europe 

www.epic.org — Electronic Privacy Information Center 

www.esre.ac.uk — Economic and Social Research Council in United Kingdom 

www.ethics.org — Ethics Resource Center 

www.ietf.org — The Internet Engineering Task Force 

www.oecd.org — Organization for Economic Cooperation and Development 

www.ponemon.org — Ponemon Institute LLC 

www.privacyconference.co.uk — International Data Protection and Privacy Commissioners — United 
Kingdom 

www. privacyinternational.com — Privacy International 

www.privacyjournal.com — Privacy Journal 

www.tfid-world.com — RFID World 

www.w3.org/p3p/ - The Platform for Privacy Preferences 

www.worldcsr.com — World Social Responsibility 
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Privacy Preserving Data Mining: 
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ABSTRACT 


Since its inception in 2000, privacy preserving data mining has gained increasing popularity in the data 
mining research community. This line of research can be primarily attributed to the growing concern of 
individuals, organizations and the government regarding the violation of privacy in the mining of their 
data by the existing data mining technology. As a result, a whole new body of research was introduced 
to allow for the mining of data, while at the same time prohibiting the leakage of any private and sensi- 
tive information. In this chapter, the authors introduce the readers to the field of privacy preserving data 
mining; they discuss the reasons that led to its inception, the most prominent research directions, as well 
as some important methodologies per direction. Following that, the authors focus their attention on very 
recently investigated methodologies for the offering of privacy during the mining of user mobility data. In 

the end of the chapter, they provide a roadmap along with potential future research directions both with 
respect to the field of privacy-aware mobility data mining and to privacy preserving data mining at large. 


INTRODUCTION 


The significant advances in data collection and data 
storage technologies have provided the means for 
the inexpensive storage of enormous amounts of 
data in data warehouses that reside in companies 
and public organizations. Despite the benefit of us- 
ing this data per se (e.g. for maintaining up to date 
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profiles of the customers and record of their recent 
or historical purchases, maintaining an inventory 
of the available products, as well as their quantities 
and price, etc), the mining of these datasets with 
the existing data mining tools can reveal invaluable 
knowledge that was unknown to the data holder 
beforehand. 

The extracted knowledge patterns can provide 
insight to the data holders and at the same time can 
be invaluable in tasks such as decision making and 
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strategic planning. Moreover, private companies 
are often willing to collaborate with other entities 
who conduct similar business, towards the mutual 
benefit of their businesses. Significant knowledge 
patterns can be derived and shared among the 
collaborative partners with respect to the collec- 
tive mining of their datasets. Furthermore, public 
sector organizations and civilian federal agencies 
usually have to share a portion of their collected 
data or knowledge with other organizations hav- 
ing a similar purpose, or even make this data and 
knowledge public. For example, the National 
Institute of Health (NIH) endorses research that 
leads to significant findings which improve hu- 
man health and provides a set of guidelines which 
sanction the sharing of NIH-supported research 
findings with research institutions. 

As it becomes evident, there exists an extended 
set ofapplication scenarios in which information or 
knowledge derived from the data has to be shared 
with other (possibly untrusted) entities. Public 
agencies for example collect data for different 
purposes like population surveys, epidemiologi- 
cal and clinical studies, as well as various other 
social and economic experiments to answer a 
variety of problems that disturb the society as 
a whole. The sharing of data and/or knowledge 
may come at a cost to privacy, primarily due to 
two reasons: (a) if the data refers to individuals 
(e.g. as in customers’ market basket data, medi- 
cal data, preferences data and the like) then the 
disclosure of this data or any knowledge extracted 
from the data can potentially violate the privacy 
of the individuals if their identity is revealed to 
untrusted third parties, and (b) if the data regards 
business (or organizational) information, then the 
disclosure of this data or any knowledge extracted 
from the data may potentially reveal sensitive 
trade secrets, whose knowledge can provide a 
significant advantage to business competitors and 
thus can cause the data holder to lose business 
over his/her peers. The aforementioned privacy 
issues in the course of data mining are amplified 
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due to the fact that untrusted entities (adversaries 
and data terrorists) may utilize other external and 
publicly available sources of information (e.g. 
the yellow pages, public reports) in conjunction 
with the released data or knowledge, in order to 
reveal even more protected sensitive information. 


BACKGROUND 


Since the pioneering work of Agrawal & Srikant 
(2000) and Lindell & Pinkas (2000), several ap- 
proaches have been proposed for the offering of 
privacy in data mining. Most existing approaches 
can be classified along two broad categories: (a) 
methodologies that protect the sensitive data itself 
in the mining process, and (b) methodologies that 
protect the sensitive data mining results (i.e. the ex- 
tracted knowledge patterns) that were produced by 
the application of data mining. The first category 
refers to methodologies that apply perturbation, 
sampling, generalization/suppression, transforma- 
tion, etc. techniques to the original datasets in order 
to generate their sanitized counterparts that can 
be safely disclosed to untrusted third parties. The 
goal of this category of approaches is to enable 
the data miner to get accurate data mining results 
when is not provided with the real data. 

As part of former category we highlight meth- 
odologies that have been proposed to enable a 
number of data holders to collectively mine their 
data without having to reveal their datasets to each 
other. On the other hand, the second category 
deals with distortion and blocking techniques 
that prohibit the disclosure of sensitive knowl- 
edge patterns derived through the application of 
data mining algorithms, as well as techniques for 
downgrading the effectiveness of the classifiers in 
classification tasks, such that they do not reveal 
any sensitive knowledge. 

Vaidya, Clifton & Zhu (2006), and Aggarwal & 
Yu (2008) provide the different research directions 
that were investigated over the past eight years 
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along with some of the methodologies that have 
been proposed along each direction. Giannotti & 
Pedreschi (2008) elaborate on methodologies that 
have been recently proposed for the offering of 
privacy in mobility data mining. In this chapter, 
we endeavor to present some of the most prevalent 
methodologies that have been proposed in privacy 
preserving data mining, along the aforementioned 
directions. In addition we pay particular atten- 
tion to methodologies that have been recently 
proposed for the offering of privacy in the mining 
of user mobility data. In the end of the chapter, 
we provide aroadmap along with potential future 
research directions with respect to the field of 
privacy-aware mobility data mining and privacy 
preserving data mining at large. 


MAIN THRUST OF THE CHAPTER 


As we previously explained, privacy preserving 
data mining is a new research area inspired by the 
need of scientists to analyze, interrogate and utilize 
row collected data without harming the privacy of 
the subjects contained in the data itself. We may 
also consider privacy preserving data mining as a 
descendant of the so-called disclosure control and 
statistical databases research areas whose main 
focus was the protection of information stored 
in databases about human and artificial subjects 
from positive or negative compromise as well as 
the controlled publication of vast data collections 
mostly by government agencies to third party 
entities like private organizations. 

In the sequel we give an overview of privacy 
preserving data mining approaches proposed for 
the protection of sensitive traditional forms of data 
like textual data. We have selected for presentation 
inthis section techniques classified as perturbative, 
non-perturbative and secure multiparty computa- 
tion. The second part in the main thrust is devoted to 
techniques related to protecting sensitive patterns 
from mining. In this part we focus our attention 


on two paradigms, the so-called association rule 
hiding and classification rule hiding. The third 
and last part is cornerstone to the significance of 
our book chapter since it is related to addressing 
state-of-the-art issues in privacy preserving data 
mining like privacy aware mobility data mining. 
The approaches presented in this part include ap- 
proaches like data perturbation and obfuscation, 
secure multipart computation approaches and 
sequential pattern hiding approaches. 


PROTECTING TRADITIONAL 
SENSITIVE DATA DURING MINING 


A wide range of methodologies have been pro- 
posed in the research literature to effectively shield 
the sensitive information contained in a dataset 
by producing its privacy-aware counterpart that 
can be safely released. The goal of all these pri- 
vacy preserving methodologies is to ensure that 
the distorted (also known as sanitized) dataset 
(a) properly shields all the sensitive information 
that was contained in the original dataset, (b) has 
similar properties (e.g. first/second order statistics, 
etc) to the original dataset — possibly resembling 
it to a high extent — and (c) maintains reasonably 
accurate data mining results (when compared to 
those attained when mining the original dataset) 
when mined. 

The protection of sensitive data from disclo- 
sure has been extensively studied in the context 
of microdata release, where methodologies have 
been proposed for the protection of sensitive 
information regarding individuals, which are 
recorded in a dataset. In microdata, we consider 
each record of the dataset to representan individual 
for whom the values of a number of attributes are 
beingrecorded (e.g. name, date of birth, residence, 
occupation, salary, etc). Among the complete set 
of attributes, there exist some attributes that ex- 
plicitly identify the individual (e.g. name, social 
security number, etc), as well as attributes which, 
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once combined together or with publicly available 
external resources, may lead to the identification 
of the individual (e.g. address, gender, age, etc). 
The first type of attributes, also known as identi- 
fiers, must be removed from the data prior to its 
publishing. On the other hand, the second type of 
attributes, also known as quasi-identifiers, have to 
be handled by the privacy preservation algorithm 
in such a way that in the sanitized dataset, the 
knowledge of their values regarding an individual 
does no longer pose a threat to the identification 
of his or her identity. 

The existing methodologies for the protection 
of sensitive microdata can be partitioned in the 
two following categories: (a) data modification 
approaches, and (b) synthetic data generation ap- 
proaches. Willenborg & DeWaal (2001) further 
partition the data modification approaches in 
perturbative and non-perturbative, depending on 
whether they introduce false information in the 
attribute-values of the data (e.g. by the addition 
of noise based on a data distribution) or they 
operate by altering the precision of the attribute- 
values (e.g. by changing a value to an interval that 
contains it). In what follows, we provide some 
more information on each of these categories of 
approaches. 


Perturbative Approaches 


In data perturbation approaches the attribute- 
values of the original dataset are modified in a 
way that the released values are inaccurate. Several 
data perturbation approaches have been proposed 
in the research literature; the most prevalent ones 
can be partitioned under the following directions: 
(a) the addition of noise based on an underlying 
distribution (see Brand, 2002 for a detailed pre- 
sentation of the methodologies in this direction), 
(b) the use of microaggregation, in which the data 
records are partitioned into groups of either fixed 
(see Domingo-Ferrer & Torra, 2005, and Defays 
& Nanopoulos, 1993) or variable (see Domingo- 
Ferrer & Mateo-Sanz, 2002, Laszlo & Mukherjee, 
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2005, and Sande, 2002) size, based on record 
similarity criteria; the average value for each 
attribute in each group is calculated and is then 
used to replace the exact attribute-value of each 
record that belongs to the group, and (c) the use 
of data swapping, in which the attribute-values of 
a set of records are exchanged (see Reiss, 1984). 


Non-Perturbative Approaches 


In non-perturbative approaches the attribute- 
values of the original data are altered in a way that 
affects the precision in which they are released 
in the sanitized dataset. The most prevalent non- 
perturbative methodologies are sampling and 
global recoding. In sampling (see Willenborg 
& DeWaal, 2001), only a portion of the original 
dataset is released. On the other hand, in global 
recoding, the exact attribute-values of the quasi- 
identifiers are replaced with less specific (more 
general) values that enable previously distinct 
data records to be combined after the sanitiza- 
tion process. 

For example, in a categorical dataset attribute 
marital status can take the value “married” for 
one record and “divorced” for another record of 
the original dataset, while it can be substituted 
with “been married” in both records in the sani- 
tized counterpart. Probably the most important 
approach that ecompases global recoding is 
K-anonymity (see Samarati, 2001 and Samarati 
& Sweeney, 1998). The K-anonymity principle 
requires that every record in the sanitized dataset 
is indistinguishable from at least K-1 other records 
with respect to a set of identifying variables that 
formulate the quasi-identifier. Several algorithms 
have been proposed to enforce K-anonymity, while 
several of its variations have been explored. Fur- 
thermore, although K-anonymity was originally 
proposed for disclosure control in the context of 
microdata, it has been successfully applied in 
different contexts and application areas, such as 
in mobility data mining (as we present later on). 
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Secure Multiparty 
Computation Approaches 


The two previous categories of approaches aim 
at generating a sanitized dataset from the original 
one, which can be safely shared with untrusted 
third parties as it contains only non-sensitive data. 
Secure Multiparty Computation (SMC) provides 
an alternative family of approaches that can effec- 
tively protect the sensitive data. SMC considers a 
set of collaborators who wish to collectively mine 
their data but are unwilling to disclose their own 
datasets to each other. As it turns out, this distrib- 
uted privacy preserving data mining problem can 
be reduced to the secure computation ofa function 
based on distributed inputs and it thus solved by 
using cryptographic approaches. 

Pinkas (2002) elaborates on former close 
relation that exists between privacy-aware data 
mining and cryptography. In SMC, each party 
contributes to the computation of the secure 
function by providing its private input. A secure 
cryptographic protocol that is executed among the 
collaborating parties ensures that the private input 
that is contributed by each party is not disclosed 
to the others. Most of the applied cryptographic 
protocols for multi-party computation result to 
some primitive operations that have to be se- 
curely performed: secure sum, secure set union, 
and secure scalar product. Clifton, et al. (2002) 
discusses these operations. 

The operation of the secure protocols in the 
course of distributed privacy preserving data min- 
ing depends highly on the existing distribution 
of the data in the sites of the collaborators. Two 
types of data distribution have been investigated: 
Inahorizontal data distribution, each collaborator 
holds anumber of records and for each records he 
or she has knowledge of the same set of attributes 
as his/her peers. On the other hand, in a vertical 
partitioning of the data, each collaborator is aware 
of different attributes referring to the same set of 
records. Some representative SMC approaches 
that operate on horizontally partitioned datasets 


can be found in the work of Inan et al. (2006), 
Jagannathan & Wright (2005), Jagannathan et 
al. (2006), Kantarcioglou & Clifton (2004), and 
Kantarcioglu & Vaidya (2003). On the other hand, 
some SMC approaches that assume a vertical data _ 
distribution have been proposed by Vaidya & 
Clifton (2002), Vaidya & Clifton (2003), Vaidya 
& Clifton (2004), Vaidya & Clifton (2005), Vaidya 
& Clifton (2006), and Yu et al. (2006). SMC has 
also been studied in the context of distributed K- 
anonymity. A secure K-anonymous protocol that 
assumes a vertical data partitioning was proposed 
by Jiang & Clifton (2005), while a secure protocol 
that ensures K-anonymity in a horizontal data 
partitioning can be found in the work of Zhong, 
et al. (2005). 


PROTECTING SENSITIVE 
PATTERNS FROM MINING 


In this section, we focus our attention on privacy 
preserving methodologies that protect the sensi- 
tive knowledge patterns that would otherwise 
be revealed after the course of mining the data. 
Similarly to the methodologies that we have 
presented so far for protecting the sensitive data 
prior to its mining, the methodologies of this cat- 
egory also modify the original dataset but in such 
a way that certain sensitive knowledge patterns 
are suppressed, when mining the data. In what 
follows, we discuss methodologies that have been 
proposed for the hiding of sensitive knowledge 
in the context of association and classification 
rule mining. 


Association Rule Hiding 


The association rule mining framework along 
with some computationally efficient heuristic 
methodologies for the production of association 
rules can be found in the work of Agrawal, et al. 
(1993) and Agrawal & Srikant (1994). Knowledge 
hiding, in the context of association rule mining, 
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aims at sanitizing the original dataset such that 
(a) all the sensitive rules (as indicated by the data 
holder) that appear when mining the original 
dataset for association rules, do not appear when 
mining the sanitized dataset for association rules 
at the same (or higher) levels of support and 
confidence, (b) all the non-sensitive rules can be 
successfully mined in the sanitized dataset at the 
same (or higher) levels of support and confidence, 
and (c) no rule that was not initially found in the 
mining of the original dataset can be found in its 
sanitized counterpart, when mining the sanitized 
dataset at the same (or higher) levels of support 
and confidence. 

The first goal simply states that all the sensi- 
tive association rules are properly hidden in the 
sanitized dataset. The hiding of the sensitive 
knowledge comes at a cost to the utility of the 
sanitized outcome. The second and the third 
goals aim at minimizing this cost. Specifically, 
the second goal requires that only the sensitive 
knowledge is hidden in the sanitized dataset and 
thus no other, non-sensitive rules are lost due to 
side-effects of the sanitization process. On the 
other hand, the third rule requires that no arti- 
facts (i.e. false association rules) are generated 
by the sanitization process. To recapitulate, in 
association rule hiding the sanitization process 
has to be accomplished in a way that minimally 
affects the original dataset, preserves the general 
patterns and trends of the dataset, and achieves to 
conceal all the sensitive knowledge, as indicated 
by the data holder. 

The problem of association rule hiding has 
been studied along three directions, namely (a) 
heuristic approaches, (b) border-based approaches, 
and (c) exact approaches. The first class of ap- 
proaches collects time and memory efficient 
algorithms that heuristically select a portion of 
the transactions of the original dataset to sanitize, 
in order to facilitate sensitive knowledge hiding. 
Due to their efficiency and scalability, these ap- 
proaches have been investigated by the majority 
of the researchers in the knowledge hiding field 
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of privacy preserving data mining. However, as 
in all heuristic methodologies, the approaches of 
this category take locally best decisions when 
performing knowledge hiding, which may not 
always be (and usually are not) globally best. 

As a result, there are several circumstances in 
which these methodologies suffer from undesir- 
able side-effects and may not identify optimal 
hiding solutions, even if they exist. Heuristic ap- 
proaches can rely on a distortion (i.e. inclusion/ 
exclusion of items from selected transactions) or 
on a blocking (i.e. replacing some of the origi- 
nal values in a transaction with question marks) 
scheme. Some distortion-based algorithms for 
association rule hiding can be found in the work 
of Atallah, et al. (1999), Dasseni, et al. (2001), 
Oliveira & Zaiane (2003), Verykios, et al. (2004), 
and Wu, et al. (2007). Some blocking-based al- 
gorithms can be found in the work of Saygin, et 
al. (2001), Saygin, et al. (2002), Pontikakis, et al. 
(2004), and Wang & Jafar (2005). 

The second class of approaches collects meth- 
odologies that hide the sensitive knowledge by 
modifying the original borders in the lattice of 
the frequent (i.e. statistically significant) and the 
infrequent (i.e. statistically insignificant) patterns 
of the original dataset. In particular, the sensitive 
knowledge is hidden by enforcing the revised 
borders (which accommodate the hiding of the 
sensitive itemsets) in the sanitized database. The 
algorithms in this class differ in the borders they 
track, as well as in the methodology that they ap- 
ply to enforce the revised borders in the sanitized 
dataset. An analysis regarding the use of borders in 
association rule mining can be found in the work of 
Mannila and Toivonen (1997). Two border-based 
approaches to association rule hiding can be found 
in the work of Sun & Yu (2005) and Moustakides 
& Verykios (2006). 

Finally, the third class of approaches in- 
volves non-heuristic algorithms which conceive 
the knowledge hiding process as a constraints 
satisfaction problem (an optimization problem) 
that is solved through the application of integer 
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or linear programming. This class of approaches 
differs from the previous two, primarily due to 
the fact that it collects methodologies that they 
can guarantee optimality in the hiding solution 
(provided that an optimal hiding solution exists). 
On the negative side, these approaches are usu- 
ally several orders of magnitude slower than the 
heuristic ones, especially due to the runtime that is 
required for the solution of the constraints satisfac- 
tion problem by the integer/linear programming 
solver. Menon, etal. (2005) proposes an approach 
that combines a constraints satisfaction problem 
with a heuristic algorithm for data sanitization to 
improve the quality of the hiding solution. The 
proposed approach, although interesting, it may 
lead to suboptimal solutions even if optimal ones 
exist. A family of more advanced methodologies 
that guarantee optimality of the hiding solution 
can be found in the work of Gkoulalas-Divanis & 
Verykios (2006), Gkoulalas-Divanis & Verykios 
(2008), and Gkoulalas-Divanis & Verykios (2009). 


Classification Rule Hiding 


Privacy-aware classification has been studied toa 
substantially lower extent than privacy preserving 
association rule mining. Similarly to association 
rule hiding, classification rule hiding algorithms 
consider a set of classification rules as sensitive 
and proceed to protect them from disclosure by 
using either suppression-based or reconstruction- 
based techniques. In suppression-based techniques 
the confidence of a classification rule (measured 
in terms of the owner’s belief regarding the hold- 
ing of the rule when given the data) is reduced 
by distorting a set of attributes in the dataset that 
belong to transactions related to its existence. 
Some approaches that fall under this category are 
proposed by Chang & Moskowitz (1998), Clifton 
(2000), Johnsten & Raghavan (2000), Chen & Liu 
(2005), and Wang, et al. (2005). 

A system that is based on former category 
of approaches was proposed by Moskowitz & 
Chang (2000). On the other hand, reconstruction- 


based approaches target at reconstructing the 
dataset by using only those transactions of the 
original dataset that support the non-sensitive 
classification rules. The works of Natwichai, et 
al. (2005) and Natwichai, et al. (2006) fit in this 
category of approaches. Katsarou, et al. (2009) 
proposes an intermediate approach that performs 
reconstruction-based classification rule hiding 
through controlled data modification. 


PRIVACY AWARE MOBILITY 
DATA MINING 


The remarkable advances in telecommunications 
and in location tracking technologies, such as 
GPS, GSM and UMTS, have made possible the 
tracking of mobile devices (and thus their human 
companions) at an accuracy ofa few meters, at an 
affordable cost. From this perspective, we have 
nowadays the means of collecting, storing and pro- 
cessing mobility data of unprecedented quantity, 
quality and timeliness. The movement traces, left 
by the mobile devices of the users, are an excel- 
lent source of information that can aid towards 
decision making in mobility-related issues, such 
as urban planning, traffic analysis, forecasting of 
traffic-related phenomena, and timely detection 
of problems that emerge from users’ movement 
behavior. On the other hand, it becomes evident 
that on the wrong hands this type of emergent 
knowledge may lead to an abuse scenario, as the 
mobility data may reveal highly sensitive personal 
information. Some examples of misuse include, 
but are not limited to, user tailing, surveillance 
or even unsolicited advertising. 

Asaresult, in the last few years, a set of privacy 
preserving methodologies have been proposed for 
the protection of sensitive data and/or knowledge 
related to user mobility. The existing so far meth- 
odologies can be partitioned in two broad catego- 
ries: (a) methodologies that protect the sensitive 
data related to user mobility prior to the course 
of data mining, and (b) methodologies that hide 
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sensitive knowledge patterns that summarize user 
mobility, which are identified as a result of the 
application of data mining. The first category of 
approaches collects data perturbation and obfusca- 
tion methodologies that distort the original dataset 
to facilitate privacy-aware data publication, as 
well as distributed privacy-aware methodologies 
for secure multiparty computation. On the other 
hand, the second category of approaches treats 
the mobility data as sequential data and applies 
a sequential pattern hiding strategy to prevent the 
disclosure of the sensitive sequential patterns in 
the course of sequential pattern mining. After the 
application of these approaches, only the non- 
sensitive patterns, summarizing user’ movement 
behavior, survive the mining process, while the 
sensitive ones are suppressed in the data mining 
result. In what follows, we present in detail some 
of the approaches that have been proposed along 
each of these three categories. 


Data Perturbation and Obfuscation 


Data perturbation and obfuscation approaches 
aim at sanitizing a dataset containing user mo- 
bility data, in such a way that an adversary can 
no longer match the recorded movement of each 
user to a particular individual (thus reveal the 
identity of the user based on his or her recorded 
movement in the sanitized dataset). In what fol- 
lows, we consider that user mobility is captured 
as a set of trajectories (one per user) that depict 
the locations and times in the course of his or her 
history of movement. We assume that these loca- 
tion/time recordings occur at a reasonably high 
rate that allows the tracking of user movement in 
the original dataset. For example, an adversary 
could use these recordings to track the user down 
to his/her house or place of work, even if the user 
trajectory was notaccompanied by an explicit user 
identifier, such as the user id, the social security 
number or even the name of the user. 

Hoh & Gruteser (2005) present a data per- 
turbation algorithm that is based on the idea of 
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path crossing. The proposed approach identifies 
when two nonintersecting trajectories that belong 
to different users are reasonably close to each 
other and generates a fake crossing of these two 
trajectories in the sanitized dataset. The goal of 
this approach is to prevent an adversary from 
successfully tracking a complete user trajectory 
in the sanitized dataset, and thus identifying the 
corresponding user. Provided that many crossings 
of trajectories exist in the sanitized dataset, the 
probability that an adversary succeeds in follow- 
ing the same individual prior and after a crossing 
of this user’s trajectory with one or more other 
trajectories in the dataset, sufficiently deteriorates. 
As the authors demonstrate, path confusion can be 
formalized as a constrained nonlinear optimization 
problem which, when given the trajectories of two 
users within a bounded area where a crossing has 
to occur, it estimates the perturbed locations for 
each user such that their trajectories meet within 
a pre-specified time period. 

To continue, at each generated fake user loca- 
tion towards the meeting of the two trajectories, the 
algorithm takes special care to keep the enforced 
perturbation of the exact user location within 
reasonable bounds. In order to achieve this, each 
perturbed (fake) location has to reside within a 
given perturbation radius (indicating the maximum 
allowable perturbation) from the original user 
location. As is expected, a larger radius increases 
the degree of privacy that is offered to the users 
but also deteriorates the utility of the sanitized 
dataset. Equivalently, a smaller radius offers less 
privacy to the users but achieves a better utility 
of the publicized data. Through experiments the 
authors prove that the proposed algorithm limits 
the duration in which an adversary can successfully 
track the same individual in the sanitized dataset. 

Hoh, et al. (2007) introduces a new empirical 
measure for the quantification of privacy in a set 
of publicized location/time recordings. The pro- 
posed measure calculates the time that a user can 
be successfully tracked (by an adversary) based 
on the knowledge of his/her user trajectory. The 
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proposed measure calculates the time that elapsed 
between two consecutive occasions where the 
adversary could not determine (at least with suf- 
ficient certainty) the next location/time recording 
in the trajectory of the user. By using this measure, 
the authors propose a path perturbation strategy 
that relies on data coarsening to exclude a limited 
amount of location/time recordings from a user 
trajectory. The applied coarsening strategy ensures 
that the corresponding user cannot be tracked (at 
least with sufficient certainty) by the adversary, for 
atime that exceeds a pre-specified time threshold 
(i.e. the maximum time to confusion). To achieve 
this goal, the perturbation algorithm discloses a 
location/time recording of a user trajectory (as it 
appears in the original dataset) to the sanitized 
dataset, only if the time that has passed since the 
last point of confusion is below the pre-specified 
time threshold. 

Terrovitis & Mamoulis (2008) consider da- 
tasets that depict user mobility in the form of 
sequences of places that each user has visited in 
the course of his/her movement. For each user, 
the authors assume the existence of a transaction 
in the dataset that contains the list of places that 
this particular user has visited (e.g. based on his/ 
her card transactions), set out in the order of visit. 
No other information of spatial or temporal na- 
ture (e.g. the exact time of each visit) is assumed 
to be provided. Based on this type of data, the 
authors propose a suppression technique that 
removes some of the places that were visited by 
specific users, in order to protect their identity 
from adversaries who hold partial information 
on the user trajectories. Specifically, an adver- 
sary is considered to be any individual who has 
knowledge of certain places that were visited by 
particular users, for whom he/she knows their 
identity. To exemplify, consider a bank which has 
many branches in a city. Each branch of the bank 
has some ATM machines that people can use to 
perform regular money transactions. Whenever a 
person uses the ATM of the bank this transaction 
is recorded. 


Nowassume that the bank manager has posses- 
sion of the original dataset of user mobility, where 
he/she identifies that some of the users that appear 
in the dataset have visited certain branches of the 
bank. By using this information it is possible that 
the manager can figure out the identity of some 
of the users who are recorded in the dataset and 
then learn the other places that they have visited 
during their movement. To protect the privacy of 
the users when publicizing their movement data, 
the proposed methodology assumes that the data 
holder has knowledge of the sets of places (i.e. 
the projection of the dataset) that are known to 
each individual adversary. In our example, the data 
holder knows the branches of the bank that the bank 
manager controls. By using this information, the 
data holder can compute the probability by which 
the corresponding adversary can infer the identity 
of a user in the publicized dataset, based on the 
projection of the data that the adversary holds. 
The proposed suppression strategy operates in an 
iterative fashion to minimize the probability of a 
given adversary to associate (based on his/her data 
projection) a place that appears in the publicized 
data to the identity of a particular person. (Some 
other interesting approaches in this direction are 
proposed in the work of Pensa, et al. 2008, Nergiz, 
et al. 2008 & Abul, et al. 2008). 


Secure Multiparty Computation 


Secure multiparty computation has also been 
studied in the context of user mobility (and 
more generally on spatiotemporal) data. Inan & 
Saygin (2006) were the first authors to propose a 
privacy-aware methodology that clusters a set of 
spatiotemporal datasets, owned by different par- 
ties. To perform clustering, a similarity measure 
is necessary in order to quantify the proximity 
between two objects (e.g. the user trajectories), 
such that in the computed clustering solution, 
the co-clustered objects are more similar to one 
another than to objects belonging in different 
clusters. As part of this work, the authors propose 
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a secure protocol that can be employed to enable 
the pairwise secure computation of trajectory 
similarity among all the trajectories of the different 
parties, thus building a global matrix of trajectory 
similarity. By using this matrix, a trusted third 
party can perform the clustering on behalf of the 
users and communicate the computed clustering 
results back to the collaborating parties. The pro- 
posed privacy preserving protocol supports all the 
necessary basic operations for the computation 
of trajectory similarity based on widely adopted 
trajectory comparison functions: (a) Euclidean 
distance, (b) longest common subsequence, (c) 
dynamic time warping, and (d) edit distance. 

The protocol makes the following assump- 
tions: (a) it assumes a semi-honest model in 
which all the parties follow the protocol but may 
also store any information that they receive from 
other parties in order to infer private data, (b) the 
parties do not mutually share any other kind of 
information, and (c) the mobility data that is to 
be clustered follows a horizontal partitioning. The 
proposed methodology operates as follows: (a) 
every involved party, including the trusted party, 
generates pairwise keys which are used to disguise 
the exchanged messages, (b) each party locally 
computes the trajectory similarity matrix (based 
on the commonly accepted trajectory comparison 
function) for its own trajectories and securely 
transmits it to the trusted party, (c) for every pair 
of trajectories that belong to the datasets of differ- 
ent parties, the two parties execute the protocol to 
compute the similarity of their trajectories, build 
asimilarity matrix based on their trajectories, and 
subsequently transmit it to the trusted party, and 
(d) the trusted party uses the computed matrix of 
trajectory similarity based on the trajectories of 
all the collaborating parties, in order to perform 
trajectory clustering. An interesting observation 
is that by using this technique, the trusted party is 
free to choose any clustering algorithm, depending 
on the requirements of the data holders, in order 
to perform the clustering of the trajectories. 
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Sequential Pattern Hiding 


The extraction of frequent patterns from mobility 
data has primarily concentrated on the sequential 
nature of such datasets by extracting frequent 
subsequences of user mobility (e.g. Cao, et al. 
2005, Giannotti, et al. 2006). Giannotti, et al. 
(2007) proposed the integration of spatial and 
temporal information in the extracted mobility 
patterns by temporally annotating the extracted 
sequences, depicting frequent movement, with 
the transition times from one element (place of 
interest) to another. In a similar manner, the ap- 
proaches that have been proposed for the hiding 
of frequent mobility patterns consider knowledge 
hiding in the form of sequential pattern hiding. In 
what follows, we present an approach that belongs 
in this category. 

Abul, et al. (2007a) models the problem of tra- 
jectory hiding to that of sequential pattern hiding. 
The authors consider that pertinent to every sensi- 
tive sequence isa disclosure threshold that defines 
the maximum number of sequences in the sanitized 
database that are allowed to support the sensitive 
sequence. The sequence sanitization operation is 
based on the use of unknowns to mask selected 
elements in the sequences of the original dataset. 
As the authors prove, the problem of sanitizing a 
sequence from the original dataset, while introduc- 
ing the least amount of unknowns, is NP-hard and 
thus one needs to resort to heuristics to identify 
an efficient solution. The proposed heuristic op- 
erates as follows: For each sensitive sequence, 
the algorithm searches all the sequences of the 
original database to identify those in which the 
sensitive sequence is a subsequence (a sequence 
S, isasubsequence of another sequence S, ifitcan 
be obtained by deleting some elements from S,). 
For every such sequence of the original dataset, 
the algorithm examines in how many different 
ways this sequence becomes a subsequence of the 
sensitive one. Each “different way” (also called 
a matching) is counted based on the position of 
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each element in the sequence that participates to 
the generation of the sensitive sequence. 

Thus as an effect, for each element of the 
sequence coming from the original dataset, the 
algorithm maintains a counter depicting the 
number of matchings in which it is involved. To 
sanitize the sequence, the algorithm iteratively 
identifies the element of the sequence which has 
the highest counter (i.e. it is involved in most 
matchings) and replaces it by an unknown, until 
the sensitive sequence is no longer a subsequence 
of the sanitized one. As a result of this operation, 
the sensitive sequence becomes unsupported by 
the sanitized sequence. In order to enforce the re- 
guested disclosure threshold the algorithm applies 
this sanitization operation in the following manner: 
For each sensitive sequence, all the sequences of 
the original dataset are sorted in ascending order 
based on the number of different matchings that 
they have with the sensitive sequence. Then, the 
algorithm sanitizes the sequences in this order, 
until the required disclosure threshold is met in 
the privacy-aware version of the original dataset. 

A similar approach to that of Abul, et al. 
(2007a), which operates by removing (instead 
of masking) elements from sequential mobility 
patterns and assumes an underlying network of 
user movement from which those patterns were 
extracted, is presented in the work of Abul, et al. 
(2007b). 


FUTURE TRENDS 


Data mining is a rapidly evolving field counting 
numerous conferences, journals and books that 
are dedicated to this area of research. As new 
forms of data come into existence, as well as new 
application areas and challenges arise, it becomes 
evident that innovative privacy preserving data 
mining methodologies will also have to be pro- 
posed to keep pace with this progress. The current 
applications of privacy preserving data mining are 
numerous, spanning from the offering of privacy 


in the release of medical and genomic databases to 
the extraction of knowledge patterns that provide 
information related to homeland security. Mobility 
data mining, as well as privacy-aware stream data 
mining are among the most recent and prominent 
directions of privacy preserving data mining. 
As spatiotemporal and geo-referenced datasets 
grow, a novel class of applications is expected 
to appear that will be based on the extraction 
of behavioral patterns of user mobility. Clearly, 
in these applications privacy is a major concern 
and thus novel privacy preserving methodologies 
will have to be proposed to protect those patterns 
that are sensitive with respect to the privacy of 
individuals. In what follows, we briefly present 
some future research directions both with respect 
to the field of privacy-aware mobility data mining 
and to privacy preserving data mining at large. 


Privacy-Aware Mobility Data Mining 


Spatiotemporal datasets present a new challenge 
to the privacy preserving data mining community 
due to their spatial and temporal characteristics. 
Few approaches have been proposed so far that 
achieve to address some of the special require- 
ments of this type of data. A basic drawback of 
the existing methodologies is that they fail to 
treat space and time equally well. Instead, most 
of the approaches that have been proposed put 
their effort on the adequate treatment of either 
the spatial or the temporal dimension of the data, 
but not both. As a result, user mobility data is 
often transformed into sequential data, where the 
spatial component is reduced to a set of places of 
interests (events) and the time component (apart 
from providing the total ordering of these events 
in the sequence) is disregarded. Thus, we feel 
that there is plenty of room for research in this 
interesting and challenging area. 

As presented earlier, privacy preserving data 
mining in the context of mobility data has been in- 
vestigated towards three broad research directions: 
(a) data perturbation approaches, (b) distributed 
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(SMC) approaches, and (c) knowledge hiding 
approaches. Based on the number of published 
works per direction, it becomes evident that most 
of the research effort has been placed towards the 
development of data perturbation methodologies, 
while few approaches have been devised to sup- 
port the other two directions. We believe that in 
the upcoming years, data mining researchers will 
put more effort in devising novel algorithms for 
the hiding of user mobility patterns, especially 
due to the urging need of these methodologies 
in various application contexts. The hiding of 
sensitive mobility patterns imposes far greater 
challenges than traditional knowledge hiding, 
since specially crafted algorithms are necessary 
to identify all the important correlations that exist 
within the datasets. 

Furthermore, the mining of sensitive knowl- 
edge, depicted in the form of associations in mo- 
bility datasets, may allow for the use of different 
measures of pattern interestingness than the com- 
monly employed support and confidence metrics. 
As an effect, new knowledge hiding techniques 
may have to be investigated that will successfully 
conceal this novel type of sensitive knowledge. 


Privacy Preserving Data Mining 


As mentioned earlier, privacy preserving data min- 
ing is a highly evolving area with a tremendous 
amount of applications and with many opportuni- 
ties for research. A recent trend in this area is the 
exploring of new data types and novel domains of 
potential knowledge. The privacy-aware mining of 
mobility data is only one among the hot research 
directions. Another example ofa research domain 
that is expected to receive a lot of attention in the 
upcoming years is the offering of privacy in the 
context of applications where data is released 
incrementally and in an unconditional rate. 

In this challenging area of research, privacy 
preserving data mining methodologies have to 
be designed to handle streams of data rather than 
datasets containing historical recordings. Finally, 
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apart from domain-driven research, as the one 
presented in the above-mentioned examples, there 
is currently an urging need for the development 
of frameworks that will unify more advanced 
measures for the evaluation and the compari- 
son of different privacy preserving data mining 
methodologies. 


CONCLUSION 


In this chapter, we presented an overview of 
privacy preserving data mining, one of the most 
popular directions in the data mining research 
community. In the first part of the chapter, we 
presented approaches that have been proposed 
for the protection of either the sensitive data it- 
self in the course of data mining or the sensitive 
data mining results, in the context of traditional 
(relational) datasets. Following that, in the sec- 
ond part of the chapter, we focused our attention 
on one of the most recent as well as prominent 
directions in privacy preserving data mining: the 
mining of user mobility data. Although still in its 
infancy, privacy preserving data mining of mobil- 
ity data has attracted a lot of research attention 
and already counts a number of methodologies 
both with respect to sensitive data protection and 
to sensitive knowledge hiding. Finally, in the end 
of the chapter, we provided some roadmap along 
the field of privacy preserving mobility data min- 
ing as well as the area of privacy preserving data 
mining at large. 
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KEY TERMS AND DEFINITIONS 


Privacy Preserving Data Mining: The area 
of data mining that is concerned with privacy 
issues related to the course of data mining, and 
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specifically (a) with the protection of privacy in 
data releases, (b) the preservation of privacy in 
the mutual mining of data among a set of col- 
laborating parties, and (c) with the protection of 
sensitive knowledge patterns that can be derived 
due to the application of data mining tools. Pri- 
vacy preserving data mining is one of the most 
challenging research areas within the data mining 
community, already counting numerous confer- 
ences, workshops, and journals. 

Sanitization: The process of transforming the 
original dataset to its privacy-aware counterpart 
that can be safely released as it protects the sensi- 
tive data or shields the sensitive knowledge, from 
unauthorized exposure. 

Rule Hiding Approaches: A category of 
methodologies that aim at protecting the sensitive 
knowledge that can be mined from a dataset in 
the form of sensitive association or classification 
rules. The rule hiding approaches primarily oper- 
ate by sanitizing the original dataset such that the 
significance of the sensitive rules deteriorates in 
its sanitized counterpart to such an extent that 
they are no longer mined by the employed rule 
mining strategy. 

Perturbation Techniques: A category of data 
modification approaches that protect the sensitive 
data contained in a dataset by modifying a care- 
fully selected portion of attribute-values pairs 
of its transactions. The employed modification 
constitutes the released values inaccurate, thus 
protect the sensitive data, but also achieve to 
preserve the statistical properties of the dataset 
(e.g. the first and second order statistics) such 
that its mining yields accurate results. 

Reconstruction Approaches: A category of 
sensitive knowledge hiding approaches that oper- 


ate by generating a new dataset based on a portion 
of the transactions of the original dataset (i.e. 
those transactions that support the non-sensitive 
knowledge). This category of approaches has been 
studied in the context of classification rule hid- 
ing. The transactions of the original dataset that 
support the non-sensitive rules are used to build a 
classification model from which the transactions 
of the sanitized dataset are generated. 

Secure Multiparty Computation: A research 
direction within the area of privacy preserving 
methodologies for the protection of sensitive data. 
Secure multiparty computation collects distributed 
privacy preserving methodologies that enable a 
number of collaborating peers to collectively mine 
their data without having to reveal their datasets to 
each other. The approaches of this category operate 
by employing a family of protocols which allow 
the peers to exchange data in a secure manner. The 
security of the protocols is achieved through the 
application of cryptographic approaches which 
enable the secure computation ofa function based 
on distributed inputs. 

Privacy-Aware Mobility Data Mining: The 
field of privacy preserving data mining that collects 
methodologies that offer privacy in the mining of 
data related to user mobility. The methodologies 
of this category can be classified along three 
directions: (a) methodologies that protect the 
sensitive data that relates to user mobility prior 
to the course of data mining; (b) methodologies 
which enable a number of collaborating parties 
to collectively mine their data in a privacy-aware 
manner, and (c) knowledge hiding methodologies 
which conceal sensitive knowledge patterns that 
summarize user mobility from being identified in 
the course of mining the dataset. 


141 


142 


Chapter 8 


Data Mining Challenges in the 
Context of Data Retention 


Konrad Stark 
University of Vienna, Austria 


Michael Ilger 
Vienna University of Technology & University of Vienna, Austria 


Wilfried N. Gansterer 
University of Vienna, Austria 


ABSTRACT 


Retaining electronic communication and internet traffic data imposes novel technical and organisa- 
tional challenges for internet service providers as well as for government authorities. ISP companies 
are not only burdened by storing extraordinary amounts of data, but also must develop and adhere to 
data protection and data security policies in order to protect the data against unauthorised access or 
disclosure and against accidental destruction. The authors present distributed, horizontally partitioned 
data warehouse architecture for retaining data at each internet service provider separately. Moreover, 
they elaborate a data warehouse schema for storing e-mail data according to the European data reten- 
tion directive which facilitate parameterised data retrieval. The authors show how their system allows 
Jor applying various types of data mining techniques to both internet access and communication data. 
Finally, they discuss issues related to data security, cost and performance, and reveal limitations of 
data retention systems. 


INTRODUCTION 


The EU Data Retention directive 2006/24/EC (“Data 
Retention Directive”) of the European Parliament, 
published on 15.03.2006, requires the operators 
of publicly accessible electronic communication 
networks to store (“retain”) certain data which is 
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generated or processed in their networks to serve 
the investigation, detection, and prosecution of 
serious crime (European Parliament, 2007). Na- 
tional service providers are required to implement 
and maintain the technical means needed to store 
and provide this data to government authorities 
upon request. For various categories of electronic 
communication, including Internet access, Internet 
e-mail and Internet telephony, the directive defines 
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which data has to be retained. Affected are traf- 
fic and location data (but not the contents of the 
communication) for a period of time between six 
months and two years. 

In this chapter, we discuss the application of 
data mining methods in the context of implement- 

ng this EU Directive which has implications 
for both, public and private sectors. Retaining 
electronic communication and internet traffic 
data imposes novel technical and organisational 
challenges for internet service providers as well as 
for government authorities. These challenges not 
only relate to the collection and the management 
»f the data to be retained, but also to the analysis 
of the data, for example, when having to respond 
to queries posted by government authorities. 

Challenges for the ISP company: A data 
retention system has to respond to enquiries of 
competent authorities "without undue delay’ 

Elizalde, 2006). That is, instead of storing e- 
mail communication in log files, data has to be 
stored in a structured way facilitating efficient 
data retrieval. Although the directive specifies 
the mandatory information to be stored for each 
e-mail communication, notechnical guidelines are 
ziven about how the information may be stored 
to support parameterised queries. 

Challenges for the government authorities: The 
retained data is distributed among various ISPs 
which is particularly complicating the analysis of 
e-mail data. For instance, if an e-mail is sent from 
person A to person B which are customers of two 
different providers PV1 and PV2, two separate 
enquiries are necessary to identify both individu- 
als. If the common social relationships of A and 
B are surveyed, two result sets are delivered by 
PV 1 and PV2. In order to combine the result sets 
and perform analyses, standardised data structures 
are essential. 

From the legal point of view, the ISP company 
is the owner of the customer data and responsible 
for it. The company must not retain e-mail data 
externally. Hence, a central data retention system 
hosted by authorities is not allowed. Further, for 
competitive reasons companies are usually not 


interested in storing valuable customer-related data 
outside their control. An authority may formulate 
an enquiry for a person as a result of an order of 
the court. In this case ISPs do have to deliver the 
communication data for a person timely. Thus, 
a data retention system is needed allowing dis- 
tributed, standardized and protected data storage 
for ISPs and a secure central enquiry interface 
for the authority. Therefore, we encourage using 
a distributed data warehouse system to meet all 
these requirements. 


Definitions 


‘Adatawarehouse is asubject oriented, integrated, 
non-volatile and time variant collection of data in 
support of management decisions '(Kimbal 1996). 


Data warehouses (DWH)are designed to facili- 
tate so-called Online Analytic Processing (OLAP) 
of data. Thats, data is analysed interactively based 
on hypotheses. DWH data is stored in proprietary 
schemas that are optimised for data analysis. 

In the following, we propose a data retention 
system with standardised data structures, query 
interfaces, data linkage and data analysis tools. 
We elaborate a distributed data warehouse that is 
composed of local data warehouse nodes residing 
at the ISPcompanies. We design a data warehouse 
schema for retaining e-mail and internet access 
data which satisfies the following requirements: 


e store mandatory data according to the EU 
data retention directive, 

° support person, time, and location-related 
enquiries by appropriate dimensions, and 

° store additional information useful for data 
analysis. 


RELATED WORK 
Over the past years information technology 
faced novel demands from data retention require- 


ments (Stampfel et al., 2008; Stampfel, G., & 
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Gansterer, W. N., & Ilger, M., 2008). Criminals 
may utilize modern means of communication 
like phone calls, e-mail, instant messaging, or 
data sharing to sketch, plan and coordinate their 
activities. Hence, within the vast amount of data 
available electronically at phone companies and 
internet providers, relevant information about 
interpersonal relationships and ongoing prepara- 
tion work could be contained. On the one hand, 
legal and ethical questions concerning violated 
and deprived civil rights have been raised. Since 
both communication and internet access data is 
highly sensitive, privacy-preserving measures are 
required. Kotzanikolaou (2008) proposes a public- 
key infrastructure allowing transferring encrypted 
data between internet service providers, judicious 
and law enforcement authorities. 

On the other hand, efficient and precise data 
analysis techniques are needed. Much work has 
been done in context of social network analysis 
(Menon & Hicks & Larson, 2007; Diesner & 
Carley, 2005), which strive to reveal social rela- 
tionships (e.g. communication acts) between indi- 
viduals. Recently, the National Research Council 
of the United States released a detailed report 
on the application of data analysis techniques in 
counterterrorism activities of the United States 
(Perry, 2008). Besides describing the limitations 
of current data analysis techniques, the report rec- 
ommends a periodic evaluation strategy for these 
techniques assessing the effectiveness of ongoing 
programs, evaluating technological advances and 
ensuring consistency with US Laws. 

Various commercial products supporting data 
retention have been presented recently. Some of 
them (Teradata Corporation, 2006; DATAllegro, 
2008) are based on data warehouse appliances. 
Others provide server infrastructure together with 
data appliance in one packet (Sun Microsystems, 
2007). These solutions primarily focus on efficient 
data storing and data integrity issues. Though, no 
standardised storage schema allowing to query 
independently of the underlying system has been 
defined up to now. For instance, one solution 


144 


(Hewlett-Packard, 2007) claims to be designed to 
facilitate integration with law enforcementagency 
reporting systems. That is, appropriate interfaces 
to answer enquiries of authorities would have to 
be implemented. When considering a set of ISP 
companies, each company may decide to use a 
different data retention system with proprietary 
data retrieval and export tools. Therefore, we 
suggest a standardised data schema allowing 
well-defined queries and standardised exports to 
leverage data analyses. 

In the following sections we provide a short 
overview of a possible database implementa- 
tion which can be used to retain the stored data. 
Based on the proposed implementation we show 
how data analysis and queries could work in this 
situation. Finally we also put a focus on some 
important limiting factors such as cost aspects 
related to the creation of a data warehouse, or 
the security requirements that have to be met to 
prevent unauthorized access. 


DATA STORAGE 


Within this section, appropriate data structures 
for storage and retrieval of e-mail and internet 
access data are designed. In theory there are 
many different approaches to store the data. The 
extreme cases would be to use one single database 
to store all the data, or to have completely differ- 
ent solutions deployed at each ISP. We propose a 
distributed data warehouse architecture allowing 
for retaining sensible customer data inside ISPs. 
Nevertheless, analyses spanning several ISPs are 
facilitated by using an identical data warehouse 
schema in each company and by defining appro- 
priate linkage attributes that may be used to join 
records across company borders. 


Background: Data Warehousing 


For building a DWH, relevant data is extracted 
from various data sources of operational business 
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Figure 1. Storage schema for e-mail data 
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units, transformed, and loaded in the specialised 
schema. A DWH schema allows viewing data 
from a multidimensional perspective (Bauer & 
Guenzel, 2004). That is, data may be viewed from 
a multitude of views depending on the type and 
complexity of data analysis requirements. We dis- 
tinguish entities that are in the centre of analyses 
and those providing supplementary information 
for a particular view on the analysed data. 

Entities of the former category are called 
facts while the latter ones are called dimensions. 
Dimensions of a data warehouse schema are used 
to aggregate fact table entries and to retrieve sum- 
marised data of a subset of interest. Moreover, 
dimensions may be used to model various granu- 
larities of perspectives. Each dimension may be 
considered as a hierarchical classification allow- 
ing access to different levels of granularity. The 
multidimensional view of data is also considered 
as a data cube spanned by dimensions and filled 
by the facts values. A data cube consists of data 
cells, where every data cell stores a fact value for 
a certain combination of dimension members. 
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Email Data Storage Schema 


In Figure | an adapted star schema illustrates how 
e-mail data may be stored. In the centre of the 
schema the fact table Message stores records of 
e-mail traffic. The fact table is connected with five 
dimension tables, whereas tables Mail Account 
and IP Address are connected each with the fact 
table twice. In the following, the dimension tables 
are described in detail: 


e Time: The time dimension stores the date 
the message was sent or received. Columns 
Day, Week, Month, Quarter, and Year are 
used to store the date in different granulari- 
ties supporting selection, roll-up and drill- 
down operations. 

e IP Address: This table is used twice. Once 
for storing the IP address of the sender and 
once for storing the address of the recipient 
However, this information is only avail- 
able if one or both (sender and receiver) 
are customers of the same ISP. The idea 
behind separating the relations between 
(IP_ Address, Person) and Mail_ Account, 
Person) is that these persons do not have 
to be identical. Consider a person using his 
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e-mail account from an internet café. In 
this case the relation (IP_Address, Person) 
gives information about the current loca- 


tion (internet café) and the relation (Mail_ 


Account, Person) gives information about 
the identity of the person. IP addresses are 
linked with some geographical informa- 
tion according to the geographical location 
of the ISP. That information may be useful 
for location-based searches, for instance, 
for finding communication relationships 
of a certain person in a specific city (See 
Figure 1). 


Note that this model allows the specification 
of three different types of location: The person 
sending orreceiving the message hasan associated 
location, the person granting Internet access to the 
sender or receiver may be assigned a location, and 
finally the ISP has an associated location. 

Like the IP_Address table, Mail Account is 
used as dimension table twice: For storing e-mail 
information of the sender and the receiver, E-mail 
accounts may be linked with entries of the Person 
table, if the information is available. This is only 
the case for customers of the ISP. However, the 
identity of a person may be detected if records 
of multiple ISPs are combined for an analysis. 

Incoming and outgoing e-mail messages are 
stored in the same fact table. However, one e-mail 
may lead to several entries in the Message table, 
as one e-mail may be sent to multiple recipients. 
More precisely, one entry in table Message stores 
a communication act between one sender and 
one receiver. 


Message Storing 


An e-mail is sent from one sender (From) to one 
or more immediate receivers (To). Optionally, one 
or more persons receive the message as carbon 
copy (CC) or as blind carbon copy (BCC). For 
incoming e-mail messages the BCC information 
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is only available for the BCC receiver. For each 
(From — To) relation a new record is inserted 
into table Message. Further, each (From — CC) 
relation is entered in the same way. If the Internet 
access information of sender and / or receiver is 
available it is stored by linking the message with 
the corresponding IP_Address entries. Attribute 
Type specifies the type of the receiver which could 
be To, CC or BCC. The message types may be 
used to weight the communication acts. 

For instance, a (sender — TO — receiver) is 
a more immediate communication act than a 
(sender — CC — receiver). The type information 
may be used in social network analysis (Diesner 
& Carley, 2005; Pathak et al., 2006). We propose 
to use numerical values to weight the different 
types, since numerical values may be aggregated 
in typical OLAP operations. For example, if the 
top ten communication partners of a certain per- 
son have to be determined, an algorithm could 
calculate a ranking measure out of the frequency 
and weights ofcommunication acts. The Direction 
attribute defines if the message is unidirectional 
or bidirectional. That is, each message containing 
an In-reply-to entry in its message header is the 
result of a bidirectional communication. 

Further the direction type could be a simple 
boolean attribute distinguishing between uni- and 
bidirectional. Attribute Multiplicity specifies the 
number of addressees of the message. That is, the 
more addressees receive the message the lower is 
the degree of privacy of the message. The multi- 
plicity information may be used to filter out pri- 
vate one-to-one communication acts or to weight 
messages by top-ranking private messages and 
low-ranking more public messages, for instance, 
mailing list messages. Outgoing e-mail messages 
are stored in the same way as incoming messages. 
The communication partners are extracted from 
the e-mail message envelopes and corresponding 
entries are created for all (From — To), (From > 
CC), (From — BCC) relations. 


Data Mining Challenges in the Context of Data Retention 


Figure 2. Storage schema for internet access data 
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Internet Access Storage Schema 


For storing internet access data a similar data 
warehouse schema may be designed. The fact table 
Internet_Access stores each visit of a website. 
That is, each fact corresponds to one website ac- 
cess of a person at a certain time. The fact table is 
connected to three dimension tables, whereas the 
table IP_Address is used two times. First, the IP 
address of the person visiting a certain website is 
recorded (Source_IP_ID), and second, the IP ad- 
dress of the visited website is stored. If the source 
IP entries may be related to single persons, they 
are linked to corresponding Person table entries. 
Otherwise, they are linked to the localisation of 
the internet access point (e.g. WLAN hotspot at 
the airport, internet café) (See Figure 2). 
Usually, the access IP entries may be mapped 
to URLs which are stored in a separate attribute. 
In order to support elaborated queries, a semantic 
categorization of websites may be helpful. For 
example, if the focus of an analysis lies on flight 
booking, only visits to airline websites, airport 
websites and flight search engines should be 
filtered. Semantic categories ease the extraction 
of meaningful information. Although it’s a labo- 
rious task to create and maintain semantic an- 
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notations of websites, the search capabilities are 
increased significantly. 

As semantic categories may be defined for 
various levels of granularities, we propose to 
allow a hierarchical classification of categories. 
A Semantic_Category may be subcategory ofan- 
other Semantic_Category. For instance, websites 
of airline and train companies can be generalized 
to transport websites. Furthermore, as websites 
may cover several interests or provide a broad 
range of services, they may be assigned to several 
categories. Therefore, there is a n:m relationship 
between IP_Address and Semantic_Category. 

The schema allows supporting a multitude 
of query types. If the focus lies on certain inter- 
ests or topics, groups of related persons can be 
identified. On the other hand, if the focus lies on 
finding common interests or activities of a group 
of persons, an intersection of all websites visited 
of all persons of the groups can be determined 
and common interests may be derived. Further, 
time and location parameters allow narrowing 
the result sets. Consider the following example: 
three persons have been cheated by an unknown 
criminal trader. Each of them was promised a 
lucrative profit-sharing of shares for investing 
several thousand Euros in property funds. 
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Figure 3. Distributed data warehouse architecture for data retention 
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After a personal meeting and handing-over of 
money, the trader broke the contact. All persons 
were contacted by email within the same month. 
As the emails included some personal details 
(correct profession, place of residence), the trader 
is assumed to select his victims after a detailed 
investigation. Hence, a comparative analysis of 
the internet activities of the cheated persons could 
shed a light on the investigation activities of the 
trader. After querying the relevant internet access 
records of the last month, the common interests are 
identified. All three persons frequently visited an 
online gambling casino and were exposing some 
personal details in associated discussion forums. 
After investigation of all other participants of 
the discussion forum and confrontation of the 
suspicious with the victims, the criminal trader 
is identified. 


Distributed Architecture 


Similar to distributed databases, data warehouses 
may be distributed over various physically sepa- 
rated systems. Much emphasis has been put on 
distributed architectures (Bellatreche etal., 1999; 
Noaman & Barker, 1999; Wehrle et al., 2005). 
Two different types of partitioning may be dis- 
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tinguished: Horizontal and vertical partitioning. 
In a vertical partitioning relations are distributed 
by separating attributes. In a horizontal partition- 
ing, the logical schema remains the same while 
records are distributed among separate systems. 
In the context of data retention, we propose a 
horizontally partitioned data warehouse schema 
that is distributed over all ISP companies. Thus, 
sensitive customer-related data is recorded at each 
ISP and remains at the company’s own storage 
devices. 

All ISPs use the same data schema for storing 
data of the customers and constantly record e-mail 
and internet traffic. Access to authorities may be 
granted by exporting and transporting relevant 
records securely (e. g., by SSH File Transfer Pro- 
tocol) to analysis software. An illustration of the 
architecture is given in Figure 3. Each ISP stores its 
e-mail and internet access data separately as shown 
in the bottom layer. Data is extracted from log 
files and written continuously into the predefined 
relational structures. The fact-dimension schemata 
illustrated in Figure 1 and Figure 2 are identical 
for every ISP in order to allow queries and data 
retrieval across company borders. An integrated 
view is used to link the distributed data sources 
together. That is, at least the dimension Person 
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has to be available in a global data repository in 
order to initiate enquiries. 

Each data analysis request is defined on the 
basis of the global data repository. A set of persons 
of interest is selected together with parameters 
for time and location. Then, the data retrieval 
requests are sent to each data source. Based on 
the search criteria, all matching entries are filtered 
out of the local data and returned to the global 
requester. The result sets may be easily combined 
because of the identical data structures. The result 
set is searched for redundant data, and duplicate 
message entries are eliminated. Finally, data 
analysis tools — for instance social network 
analysis tools — may be applied to the retrieved 
data. 


DATA ANALYSIS 


Various data analysis techniques exist that may 
be applied to internet and e-mail data. Gener- 
ally, these techniques may be assigned to two 
different categories: subject-based queries and 
pattern-based queries (DeRosa, 2004). The aim 
of a subject-based query is to reveal relation- 
ships, activities or data of a known subject. A 
suspected person is observed including its social 
relationships, financial transactions and planned 
activities. For example, by querying the credit 
card transactions, an arranged journey may be 
identified. Typically, a subject-based query link 
is answered by searching for data records of an 
individual in distributed data sources and linking 
the results. This technique is also referred to as 
linkage analysis. 

Generally, subject-based data mining is likely 
to enhance traditional police investigations. It 
allows accessing and analysing large amount of 
data more comfortably. If one ora few individuals 
of a communication network are known, entire 
communication networks can more easily be 
identified (Perry, 2008). In order to allow cor- 
rect and exhaustive queries, data records must 


be unambiguously mapped to individuals, and 
linkage over various data sources must be sup- 
ported. In context of data retention standardised 
query interfaces are required allowing person- 
centred queries over email and internet access 
data. Further, queries spanning both kinds of data 
are possible. For instance, assuming a person is 
in email contact with a group of people. Then 
common interests of the group may be deduces 
by creating an intersection of web access paths 
of all group members. 

DeRosa (2004) specifies data mining as “a 
process that uses algorithms to discover predic- 
tive patterns in data sets”. That is, the result of a 
data mining process is a set of rules, or patterns, 
or associations revealing knowledge that was 
previously unknown. While statistical analysis 
is used to verify or reject predefined hypotheses, 
data mining algorithms are deployed to generate 
hypotheses from available data. Data mining may 
be seen as one processing step in the process of 
knowledge discovery. After the patterns or rela- 
tionships are discovered, they are evaluated and 
transformed in some kind of knowledge represen- 
tation (Seifert, 2007). One of the most important 
areas of data mining is the market basket analysis. 
This analysis focuses on items that are likely to 
be bought together. A set of association rules 
reflecting the shopping habits of the consumers 
is created. For instance, people buying pasta also 
tend to buy red wine. 

Though, the derived patterns do not tell the 
user anything about the significance or value of 
the discovered patterns or relationships (Seifert, 
2007). That’s why the informative value of de- 
duced knowledge should not be overestimated. 
As described in (Baard, 2002) a federal agency 
of the United States tried to identify patterns that 
distinguish the 9/11 hijackers from the rest of the 
population. By applying data mining techniques to 
various public and private (e.g. credit card com- 
panies) data bases, distinctive patterns should be 
found. Among some other factors, a high frequency 
of pizza orders paid by credit cards was detected. 
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Apart from the questionable relevancy of such 
patterns there are some other issues concerning 
data mining that have to be considered. 

Taipale (2004) argued that commercial data 
mining typically operates on homogeneous data 
sets. Rules or patterns are derived from one type 
of data. For instance, the consumer buying habits 
are derived solely from consumers’ transactions. 
Further, the analysis of the navigation behaviour 
of web users is accomplished by evaluating web 
logs of a single site. On the contrary, domestic se- 
curity applications operate on heterogeneous data 
sets (people, locations, activities). Thus, without 
modifications established data mining techniques 
are not applicable to that kind of data. Another 
problem is the lack of sufficient records that may 
be used to recognise patterns. Data records of 
criminal or terrorist activities are rare compared 
to the vast amount of data that is available for the 
whole population. As the results of data analysis 
techniques strongly depend on the quality of data, 
data records have to be examined carefully. If the 
internet access data may not related to a single 
person, because the internet access is shared be- 
tween two persons, both the data records and the 
derived patterns are biased. 


False Positives and False Negatives 


Patterns and rules discovered by data mining 
techniques are always subject to minor inaccura- 
cies. For instance, a confidence of 98% of a rule 
specifies that the rule is correct in 98 percent of 
the analysed data records, but it does not apply in 
the remaining 2%. Let’s assume that arule is used 
for fraud detections with credit cards. If the rule 
applies to the credit card transactions of an honest 
consumer, he is wrongly assumed to be acriminal. 
In this case the rule covers a false positive person. 
Despite of the inconveniences for persons, false 
positive classifications may impose many costs 
for the national security because of the employed 
personal and technical resources (Tapac, 2004). 
In order to reduce the false positive rate Jensen et 
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al. (2003) proposed to analyse records of groups 
simultaneously instead of concentrating single 
persons. Since the terrorist threat scenarios are 
based on cooperating individuals, this approach 
allows concentrating on common interests and 
communications and may decrease the number 
of false positive persons. 

On the other hand, a rule or pattern may not 
cover all relevant cases. For instance, the financial 
transactions ofa credit card fraud are unsuspicious 
and are not covered by any detection rule. The 
number of false positives and false negatives are 
complementary. That is, if only the most reliable 
rules and patterns are selected, the number of 
false positives will decrease. However, the false 
negative rate will increase, as fewer relevant cases 
are discovered. Using less reliable rules hast the 
contrary effect. Hence, the evaluation and selec- 
tion of appropriate rules is a challenging task. 


Data Analysis Example: 
Inductive Logic Programming 


In the following, we show how inductive logic 
programming may be leveraged ina pattern-based 
data analysis of email and internet access data. 

Inductive logic programming (ILP) is a ma- 
chine learning technique for synthesising new 
knowledge from experience. It is used for the 
construction of first-order clausal theories from 
examples and background knowledge (Muggleton, 
1994). The main idea is to create a knowledge 
model of a certain domain and to use inductive 
interference to create hypotheses based on this 
model. The knowledge model consists of facts 
and rules, which can be specified in a Prolog nota- 
tion. Additionally, a set of positive and negative 
examples is used to support the learning process. 
The result of ILP is a set of induced hypotheses 
which are generalised rules that may be used for 
subsequent classifications. The ILP inference 
process may be best described by an example. 
(We are using the syntax of the ILP system Progol 
(Progol, 2007). 
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Consider the following scenario: Paul is 
proven to be involved in illegal weapon export 
businesses. Though, the group ofaccomplices has 
not been identified up to now. There are two other 
persons — Patrick and Simon — who are suspected 
to support Paul’s activities. In order to identify 
social relationships and potential common inter- 
ests the email communication data and internet 
access data is analysed. Based on Paul’s email 
data and interests, a set of relevant data records 
(Paul’s communication partner and people shar- 
ing Paul’s interests) is retrieved. These records 
are transformed into logical facts. 


person(paul). 
person(patrick). 
person(simon). 

person(jim). 

person(mario). 
person(wolf). 
person(rudolph). 
attends golf _club(jim). 
attends golf _club(mario). 
attends_golf_club(patrick). 
attends golf _club(simon). 
communicate(paul, patrick). 
communicate(paul,simon). 
communicate(simon,paul). 
communicate(patrick,paul). 
communicate(paul,wolf). 
communicate(paul,rudolph). 
criminal(paul). 


The person facts are type assignments used to 
classify e.g. the string ‘paul’ as type person. The 
fact communicate (P 1,P2) specifies, that an email 
was sent from Person P1 to P2 or vice versa. at- 
tends_golf_club(Pi) means that the person Pi is a 
frequent visitor of a certain golf club homepage. 

This information was derived by analysing 
the internet access data. For simplicity, we do 
not include any other facts regarding the internet 
access behaviour. However, the fact set may in- 


clude much more facts reflecting the interests or 
participations in organisations of persons. Finally, 
an artificial fact criminal(paul) is introduced. 
This fact is some kind of background knowledge 
which should be included in the inference process. 
Further, we need positive and negative examples 
to tell the ILP system to filter the relevant rules 
(hypotheses). 


suspicious(patrick). 
suspicious(simon). 
:-suspicious(jim). 
:-suspicious(mario). 
:-suspicious(rudolph). 
:-suspicious(wolf). 


Finally, we need to specify the focus of our 
analysis. That is, we decide which predicates may 
appear in the rule heads and which one in rule 
bodies. In our case, we concentrate on the suspi- 
cious predicate and want to generate rules that 
classify suspicious and not-suspicious persons. 
The suspicious predicate must be specified as a 
header (modeh), while all other predicates are 
only allowed to appear in the rule body (modeb). 


:- modeh(2,suspicious(+person))? 

:- modeb(1,communicate(-person,+person))? 
:- modeb(1,attends_golf_club(+person))? 

:- modeb(1,criminal(+person))? 


After searching the ILP system finds the fol- 
lowing generalised rule: 


suspicious(A):- attends_golf_club(A), 
communicate(B,A), communicate(A,B), 
criminal(B). 


This rule may be interpreted as: all suspicious 
persons attend the golf club and have bidirectional 
email contact with a criminal person. As already 
mentioned, this ‘rule’ is a hypothesis that be used 
to distinguish between positive and negative 
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examples and/or find similarities between the 
positive examples. However, sometimes rules 
just express tendencies or are only applicable to 
a small set of learning data. Therefore, an elabo- 
rated modelling of the knowledge as well as an 
evaluation of induced rules are essential tasks. 

Let’s assume a person ‘bill’ that has not been 
assumed to be involved in the weapon businesses. 
Weadd some further facts, and query the predicate 
suspicious: 


person(bill). 
communicate(paul, bill). 
communicate(bill,paul). 
attends_golf_club(bill). 
:- Suspicious(X)? 


X = patrick; 
X = simon; 
X = bill; 


Due to the new suspicious rule, bill is classi- 
fied as a suspicious person and may be eventually 
related to the criminal weapon circle. Thus, the 
newly generated rule may be reused as new facts 
come up. Moreover, the logical representation of 
relational data allows detecting relationships and 
links that would have been masked in traditional 
database systems. 

Traditional data mining techniques assume the 
data is from a single-dimensional table. They have 
not been designed for analyses of multirelational 
data. On the other, relational data mining allows 
to patterns of data from multiple relations that 
are richly connected (Mooney et al., 2002). Par- 
ticularly, these techniques seem to be effectively 
employable in the context of analysis of communi- 
cation and internet access data. Internetaccess data 
is quite complex and hard to analyse. Although, 
a bulk of site accesses exist for each individual, 
a high percentage of data may be irrelevant for 
analysis (e.g. sport news, online newspapers). 
By enriching site visits with semantic categories, 
analyses may focus on relevant parts. 
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Data Analysis Example: 
Social Network Analysis 


Let us assume that the communication acts of a 
set of persons are to be observed. Either there is a 
strong indication of social relationships among the 
persons or they are assumed to communicate with 
each other. In the latter case, a separate analysis 
could prove whether there are significant commu- 
nication acts within the group. In the former case. 
social relationships between the group and other 
previously unknown persons could be detected. In 
the following, the process of extracting common 
contacts of a group of persons is described. We 
use the communication network of eight persons 
illustrated in Figure 4. Each node (P1,.., P8) rep- 
resents a person whereas each edge models the 
communication acts between two persons. The 
weight of edge [wl, w2] summaries the number 
of messages exchanged between the connected 
persons. That is, the weights [3, 4] between P2 
and P3 indicate that 3 messages were sent from 
P2 to P3 and 4 messages were sent from P3 to P2. 

While the communication between P2 and 
P3 is bidirectional, unidirectional communica- 
tion may also occur, for instance, from P1 to P2. 
We start by globally selecting the persons to be 
investigated. A group of three persons is defined 
(P2, P4, P6) and all relevant common contacts 
between the group and other persons should be 
detected. Furthermore, a certain period of inter- 
est — e.g., two weeks — is specified. Thus, only 
the message exchange within the last two weeks 
is considered. In the next step, data requests are 
created and propagated to all ISPs’ data sources. 
At each ISP a local data extraction is triggered. 
Hence, for each person of the group, all associ- 
ated IP_Address and Mail Account entries are 
identified and the Message table is searched for 
related entries. 

Furthermore, the result set is filtered according 
to the time constraint of the last two weeks. After 
the retrieval has been completed, the data is sent 
to the global requester who combines all partial 
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result sets to a global result set. Duplicate entries 
have to be omitted before the analysis can be 
started. 

In Figure 5 the resulting condensed communi- 
cation net is shown. As only the communication 
acts from (P2, P4, P6) are relevant, the edges P1 
<—— P8 and P3 —— P5 are omitted. Further, the 
common edges of (P2, P4, P6) are combined, and 


weights are accumulated. For instance, P2 <-> 
P3, P4 —— P3 and P6 —— P3 with weights [3, 
4], [2, 2], [2, 2] are summed up to one common 
edge with weight [7, 8]. If we rank all edges we 
find that persons P3 and P5 are stronger con- 
nected to our group than persons P1 and P7. In 
our example, an edge from a person to our group 
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Table 1. Message multiplicity 
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means that at least one member of the group has 
exchanged a message with that person. 

However, amore conservative approach would 
be to keep an edge to a person only if all members 
of the group exchanged at least one message with 
the person. In our example, all edges except G1 
<—— P3 would disappear. Furthermore, message 
parameters could be included in the selection and 
combination process. For instance, if only direct 
messages sent from one person to exactly one 
other are examined, all messages having type 
“TO” as type and a multiplicity value equal to 1 
are selected. If the degree of privacy of a message 
has to be considered, messages might be weight- 
ed by multiplying a factor of 1/multiplicity to 
each message occurrence as Table 1 shows. Al- 
though fewer messages were sent from P2 to P3, 
the degree of privacy of the exchanged messages 
is higher than between P1 and P2. 


Unbalanced Data in Context of Data 
Mining 


Existing learning and classification systems may 
be significantly influenced by the frequencies of 
classes (Garcia, 2007). That is, the precision of 
classifications is reduced if one class is heavily 
underrepresented (minority class) compared to 
another (majority class). In the above-mentioned 
example, the class of suspicious persons may 
contain significant fewer people than the class of 
the non-suspicious persons and form a minority 
class. The suboptimal classification performance 
was recognized in the machine learning / data 
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mining research community, who strives to de- 
velop new techniques preventing that standard 
classifiers are overwhelmed by majority classes 
and ignore the minority classes (Chawla, 2004). 
Some approaches increase the number of samples 
in the minority class by adding artificial copies of 
samples or interpolated samples, which is called 
oversampling (Chawla, 2002). 

Other techniques are based on the concept 
of undersampling and remove random or noisy 
samples of the majority class. We want to point 
out that the retained internet access and commu- 
nication data is strongly unbalanced due to the 
vast amount of noisy data. Therefore, appropri- 
ate data mining techniques have to be deployed 
on the retention data. For instance, for applying 
inductive logic programming on unbalanced 
data, the Gleaner (Goadrich, 2006) algorithm 
was developed. However, a detailed experimental 
evaluation of existing data mining techniques, 
applied to retention data, is beyond the scope of 
this chapter and left for future research. 


DATA SECURITY ISSUES 


The amount of data stored in this scenario com- 
bined with the number of involved stakeholders 
makes it important to think about who can access 
the data. There are many different scenarios in 
which somebody might try to gain access to the 
stored data. These can range from competitors 
trying to get information about the customers 
of their competitors, to criminals trying to steal 
data to steal identities, and use names and user 
accounts for illegal activities. 


Dealing with Distributed Systems 


While providing secure access for certain user 
groups, and monitoring their activities is gener- 
ally a challenging problem, the level of difficulty 
increases significantly when dealing with a dis- 
tributed system. In our scenario we have multiple 
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ISPs, and each of them has to collect and store 
data for their own customers. Just by itself this 
problem seems fairly trivial. It just requires a 
database, and a set of users who have the rights 
to access that database. Assigning the required 
rights to users and maintaining the database is a 
process that is easily manageable. As mentioned 
before our scenario includes multiple databases, 
each maintained by one ISP. This means that each 
of these companies has not only to make sure that 
the retained data is stored in its database, but also 
has to provide means for the authorities to access 
the data, and also combine the data with the data 
provided by other ISPs. 


Data Access Control 


One main problem when it comes to the data access 
control is that there might be a very large number 
of people accessing data, and to even more compli- 
cate things data is stored in different databases in 
different locations. There are two main approaches 
which can be used for data access. Which of the 
two approaches should be used depends mainly 
on how easy the actual data retrieval is supposed 
to be. In a very simple scenario the authorities 
could ask the ISPs to provide certain information, 
and one of the ISP’s employees queries the data 
from the database and transmits that data to the 
authorities. This approach sounds very simple, 
but requires a lot of manpower on the ISP’s side. 
Furthermore this approach creates a couple of new 
questions, for example what such a request for 
data has to look like, or how the data is transmit- 
ted to the requester. 

The second approach is much more compli- 
cated when it comes to the initial setup, but greatly 
simplifies queries later on. If an architecture is 
created which allows the authorities to directly 
pull data from the ISPs’ databases (as opposed 
to a push model in the previous scenario), there 
is no active data retrieval needed on the ISP’s 
side. The initial setup requires all ISPs to agree 
on an interface, and a common infrastructure 


for authentication and transfer security. This is 
a challenging problem, as it requires one central 
point that is trusted by every involved party, and 
which also has to be willing to take on the work of 
managing the centralized structure for all parties. 


Data Privacy 


The aspect of data privacy can be covered by a 
few basic concepts which have to be considered 
when designing a system as the one proposed in 
this chapter. The requirement is to guarantee that 
the person intended to get some information, is 
also the only one who gets that information. This 
means in the first place that each user needs to be 
identified. This can be achieved by using a public 
key infrastructure (PKI). 

A public key infrastructure is an arrange- 
ment in which a user is uniquely identified by a 
trusted party acting as a certificate authority. This 
certificate authority then provides that user with 
a pair of asymmetric keys allowing that user to 
get encrypted data (which can only be read by the 
person with the private key), and also allowing 
the user to sign documents (again, this can only 
be done by the person with the private key). The 
counterpart, the public key is published by the 
certification authority to provide all other involved 
parties the means to send encrypted information 
to that user, and to verify that user’s signature. 
This technology provides the means to comply 
with the most important requirements: Identify 
the person sending a request, and then when send- 
ing the requested data, making sure that this data 
can only be read by the person who originally 
requested the data. 

Public key infrastructures are quite common 
today, and there are many initiatives trying to 
provide people, in some cases for example all 
citizens of a country, with such certificates. Ex- 
plaining further details of public key cryptography 
would by far exceed the scope of this chapter, but 
there is a plethora of books and scientific articles 
available on this topic. Right now we only need 
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to know that there are many technologies avail- 
able that can guarantee the security of data being 
transmitted to a very large degree. 

A second important aspect is to provide means 
to audit the access to data. A system that is secured 
against access from people who are not registered 
as users of a system is not enough. Many attacks 
of a system could be caused by registered users. 
For example people could be bribed by criminal 
organizations, or simply selling the data to whoever 
is willing to pay most for them. Providing audit 
trails can greatly reduce the danger of such things 
happening, and if they still happen despite these 
precautions, they at least allow the prosecution 
of the persons who abused the system. 

Audit trails are in essence nothing more than 
a list of chronological events that happened in the 
system, allowing retracing at a later point which 
user accessed which data at which time. Of course 
this does not solve the old problem of users using 
insecure passwords, leaving workstations logged 
in to secure systems, or other negligent behaviour 
of users. 


COST AND PERFORMANCE 
ASPECTS 


Two factors that are important for a project suchas 
the one described here are the costs generated by 
the storage of the data, and also the speed at which 
requests can be answered. Both of these factors 
have a strong influence on the general acceptance 
of a system, and the ease of implementation. 


Cost Aspects 


Costs can be separated into two different groups. 
First there are the initial costs to buy all the required 
hardware and create the system, and then there are 
the annual costs for keeping the system running 
and to fulfil all the queries. The first year of service 
has the highest total costs, as it requires regular 
operation, just like all other following years, but 
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also requires initial investments. Besides the pur- 
chase of required hardware and software licenses, 
staff expenses are also significantly higher dur- 
ing that period. Even though the amount of data 
that has to be stored for each customer per year 
appears to be relatively small, the requirements 
of easy access and a good secure infrastructure 
can generate a relatively high cost per customer. 


Expected Query Response Time 


A certain level of service has to be guaranteed for 
all requests. This is especially important in this 
distributed environment, as many scenarios will 
require several queries to many different ISPs. To 
make sure that each request gets a matching reply 
in a timely manner clear interfaces and processes 
have to be defined. Additional time might be 
required to combine the data, if there is no clear 
interface making sure that the data provided by 
each of the ISPs matches the format of the others. 

In an ideal implementation where a clean ar- 
chitecture is provided that does not require any 
human interaction, such a query, including the 
combination of the results provided by the different 
ISPs, could get a reply in almost real-time. In an 
architecture that requires a lot of human interac- 
tion, in the worst case that would mean that a 
request is sent to each ISP, and each of them has 
to verify the request, then prepare the data (each 
according to their own data structure) and that 
data has to be combined later on - a workflow 
that might take a very long time. 


LIMITATIONS IN CONTEXT 
OF DATA RETENTION 


There are several circumstances under which an 
Internet service provider does not possess any 
personal data about the customer it is serving. In 
these cases, what is available is information about 
the equipment used to connect to the provider 
which depends on the employed type of Internet 
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connection. In the situation ofa customer using the 
free wireless Internet access at a cafe or restaurant 
for example, the ISP operating this hotspot has 
no personal data available on which a person is 
connecting. What he/she does know is the media 
access control address of the equipment being 
used to connect to the hotspot. A MAC address 
is intended to uniquely identify one specific piece 
of networking hardware. Each equipment vendor 
has its own address range which can be looked 
up in certain directories. Although the address is 
intended to be unique, it is forgeable using readily 
available software from the Internet which renders 
the information available to the provider in this 
case unreliable. 

Another possibility ofanonymous Internet ac- 
cess is provided by certain dialup providers. Usu- 
ally a provider asks for a user ID and a password 
during the authentication process and is able to 
connect this user ID to an entry in his customer 
database therefore knowing who it is serving. 
Because of the different technical possibilities 
for two parties to communicate via e-mail, there 
is a difference in the data available for storage 
depending on the mail server and protocols used 
and the location of the participating parties. Fol- 
lowing is a description of the possible setups and 
their consequences for data collection. For an 
Internet service provider there are two possible 
sites where he/she is able to monitor the e-mail 
traffic in his/her network and in exchange with 
other networks at his own mail servers. For the 
e-mail messages being transmitted and received 
at person’s own mail servers, the data can be 
collected at these points. If his/her customers are 
using mail servers outside of his network, the data 
is being transmitted over the provider’s interfaces 
to other networks, which is where he/she is able 
to observe this traffic. All the protocols used in 
the following scenarios may be encrypted or not, 
which in some cases makes a difference. 

Scenario 1: ISP Mail Server only. Scenario one 
is the situation when two users exchange e-mail 
messages using the ISP’s mail server, independent 


of the protocols used for mail transmission and 
receipt. The mail server is controlled by the ISP 
and the parties are therefore customers of the ISP. 
This scenario does not specify how the customer is 
connected to the Internet. The usual case for him 
would be, as he is using the ISP’s mail server, to 
gain Internet access through this ISP, but it may 
also be possible for him to be using the mail server 
from another network. If the communicating us- 
ers (sender or receiver) are accessing the ISP’ s 
mail from inside the network the IP addresses of 
the corresponding internet accesses are known 
to the provider and may be immediately related 
to persons. If another national ISP provides the 
internet access the identities may be revealed 
by querying other data retention nodes in the 
distributed architecture. If the internet access is 
accomplished by a foreign provider, no informa- 
tion about the persons providing the internet ac- 
cess may be stored. Though, the identities of the 
communicating users are known, unless the mail 
accounts are misused by a third or fourth party. 
Scenario 2: ISP/Non-ISP Mail Server Mix. In 
scenario two, again the provider’s mail server is 
used, but this time one party is outside the ISP’s 
network respectively its control. The two mail 
servers are interconnected by SMTP with the 
non-ISP mail server communicating with another 
mail server or directly with the second party. It is 
both possible that the customer is accessing the 
mail server either from inside the ISP’s network 
or from the outside. Similar to scenario one the 
persons providing the internet access for the com- 
municating users may be known or not. Further, the 
personal data of only one communicating party is 
known. The personal data of the other party may 
be available at other national ISP or unknown in 
case of a foreign communication party. 
Scenario 3: Non-ISP Mail Server. As it is the 
case for people using free mail providers to send 
their mails, in scenario three an outside mail server, 
out of the control of the ISP is being used as a 
mailbox and transmitting server by a provider’s 
customer. This scenario only applies to commu- 
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nications via POP / IMAP and SMTP, therefore 
excluding web mail access. Opposed to the first 
two scenarios, this scenario does set the location 
of the customer. Because of the used mail server 
not being in control of the ISP, the customer has 
to be inside of its network for his e-mail traffic to 
be visible on the network boundary. This scenario 
is of particular interest because it represents the 
only circumstances relevant for the EU directive 
where an encryption, which is possibly applied to 
the communication, does make a difference: this 
case is invisible to the provider if the POP, IMAP 
or SMTP sessions are encrypted. 

Scenario 4: Non-ISP Web mail. When using a 
web mail interface of a mail provider not being 
the user’s ISP, scenario four is the case. The user 
gains access to his mailbox, residing ona free mail 
provider’s mail server like mail.google.com for 
example, by using HTTP exclusively. By ’web 
mail server’ the web server hosting the interface 
for accessing the user’s e-mail messages is meant. 
This web mail server is in some, possible propri- 
etary way, connected to or is itself the mail server 
whocommunicates either directly with the second 
party or via additional mail servers. This scenario 
is invisible to the ISP because the communication 
happens via HTTP. The data is, like in scenario 
three, being transferred over the provider’s lines 
but not as SMTP but HTTP packets. The e-mail 
data is therefore embedded in web traffic and may, 
as it is content of a communication, not be stored 
according to the EU directive. 


CONCLUSION 


In March 2006 the European Parliament pub- 
lished the EU Data Retention directive 2006/24/ 
EC, requiring operators of publicly accessible 
electronic communication networks to store data 
regarding certain activity on their network to help 
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fighting crime. The directive requires the storage 
of data describing certain activities, such as who 
used internet access at which times, or who sent 
or received e-mail at certain times, but it explic- 
itly does not allow the storage of any contents 
of the electronic communication. This directive 
poses several challenges for authorities, and the 
affected ISPs alike. A whole new infrastructure 
has to be created, allowing the easy and efficient 
retrieval and usage of data that was previously 
only stored in simple log files, or in many cases 
not stored at all. 

In this chapter we have analyzed the require- 
ments for a data warehouse that meets the direc- 
tive’s requirements, focusing on two main aspects, 
the storage of e-mail-related data, and the storage 
of internet access information. After describing a 
couple of possible approaches we chose one that we 
found to be the best. Besides providing an example 
what the implementation of a database structure 
might look like, we also showed how a distributed 
architecture can be used ina scenario like the data 
retention guideline. In the data analysis chapter we 
provided examples of how different data mining 
strategies can be used within the scenario we are 
faced in this chapter. One of the examples shows 
how inductive logic programming can be used 
to derive information from known facts. Another 
example show how social network analysis can 
be performed on the data. 

Finally we also provided an overview of many 
factors surrounding the scenario ofretaining large 
amounts of data and later on providing access to 
the data. To be more precise we described aspects 
related to data and access security, privacy of 
sensitive data, as well as cost and performance 
aspects connected with running such a system. 
We also provided a short list ofexamples showing 
the limitations of data retention in the context of 
internet related technologies. 
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Data Mining Challenges in the Context of Data Retention 


KEY TERMS AND DEFINITIONS 


EU Data Retention Directive 2006/24/EC: 
Requires that the operators of publicly accessible 
electronic communication networks to store (“re- 
tain’’) certain data which is generated or processed 
in their networks to serve the investigation, detec- 
tion, and prosecution of serious crime. 

Data Warehouse (DWH): A specialized 
data storage technique that facilitates analytic 
processing of data. That is, data is analyzed 
interactively based on hypotheses. DWH data is 
stored in proprietary schemas that are optimized 
for data analysis. 

Data Mining: Is a type of data analysis. The 
result of a data mining process is a set of rules, or 
patterns, or associations revealing knowledge that 
was previously unknown. While statistical analysis 
is used to verify or reject predefined hypotheses, 
data mining algorithms are deployed to generate 
hypotheses from available data. 

Social Network Analysis: Is a technique 
that strives to reveal social relationships (e.g. 


communication acts) between individuals that 
have not been known before. It may be applied 
to communication data such as email traffic data. 

Inductive Logic Programming (ILP): Is a 
pattern-based data analysis technique that may 
be applied to email and internet access data. The 
main idea is to create a knowledge model of a 
certain domain and to use inductive interference 
to create hypotheses based on this model. 

Internet Service Provider (ISP): Is an or- 
ganization offering internet access to customers. 
The internet access is provided by servers of the 
company. Additionally, an ISP may offer email 
accounts for customers. 

Data Privacy: Covers aspects of data protec- 
tion of person-related information that is collected 
and stored in information systems of companies 
and/or authorities. The collected data must be 
protected from disclosure. That is, authentication 
and access control policies are used to prevent 
unauthorized access of person-related data. 
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ABSTRACT 


Understanding data mining (DM) as part of Information Systems (IS) this contribution investigates the 
question how this subordination is reasoned in a technological and business logical perspective. For this 
purpose general characteristics of Enterprise Resources Planning Applications (ERP) and Management 
Information Systems (MIS; including here Decision Support and Expert Systems) are presented. Based 
on this evaluation it is examined how knowledge and DM are becoming interdependent for Knowledge 
Management (KM) in organizations. Knowledge is defined along the Penrose’an dichotomy of informa- 
tion and knowledge in the context of resources and services. Validity of knowledge is analyzed from a 
methodological (quantitative versus qualitative methods) perspective, probing what key characteristics 
of both method strands are, and how those fit into the discipline of Organizational Studies. Unveiling a 
relationship between security and information in Penrose, an alternative account of security originating 
in Foucault is presented. In this security and knowledge become means for standardization of live in 
order to allow for continuation of an abstracted, socially generated object. Combining arguments about 
validity of knowledge claims with that of security, DM based knowledge and security are identified as 
means abstracting from a human core and attempting constraining variability. Against this background 
researchers and users of DM based knowledge are asked for awareness of the constructed character of 
IS, and how much of this constructed character is contained in DM based knowledge. 
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INTRODUCTION 


The purpose of this paper is examining what grants 
the special status of data mining (DM), and more 
general business intelligence (BI), in the dedicated 
organisation and in organisational sciences. On 
a conceptional level DM is understood as part of 
Information systems (IS) research. Furthermore, 
this paper is interested learning and developing 
an account of DM and its knowledge creation 
capabilities. The scope of this paper is limited to 
intra-organisational and academic development 
and spread of knowledge to create better “action 
options”. It does not consider actual methods for 
data mining. The paper is not reporting about DM 
procedures for generation of data pools originating 
in different organisations. The author examines the 
processes of ‘knowledge’ generation and validity 
assignment; the paper is of theoretical nature, and 
rests on interpretative methods. 
Two research questions are examined 


How does data mining fit into organizations’ in- 
formation system landscape for information (and 
knowledge) collection and spread? 


How validity is attributed to knowledge that is 
generated by DM, while other methodological 
approaches are neglected in organisations and 
organisational sciences? 


Based on these research questions in an in- 
troduction chapter an overview to information 
systems (IS) and how DM is related is given. 
Following this introduction, knowledge/ Knowl- 
edge Management and methods for knowledge 
generation are examined based on an Penrose’an 
understanding of knowledge. Penrose is chosen 
here, as she is widely perceived as one “founder” 
of knowledge management. Taking up the question 
why DM based knowledge is more valued then 
qualitative methods originating one, character- 
istics of knowledge formation in both method- 
ological stances are presented. It is examined how 


DM relates to criticism brought forward against 
knowledge generated outside of academic and re- 
search (e.g. Ravetz 1996; Thompson-Klein 1996; 
Nowotny etal., 2004). Arguments developed there, 
form the background for identification of Future 
Trends in knowledge generation via DM. The 
chapter concludes in consolidating the technical 
background of DM under the heading of knowl- 
edge managementand the alleged methodological 
superiority of DM. This consolidation happens 
based on Foucault’s analysis of security and how 
IS relate to this understanding. 

This paper relates itself to the fields of DM 
and Information Systems suggesting DM is output 
from and contributing to Information Systems (IS). 
The author takes up the offer formulating perspec- 
tives on the subject of DM and value creation with 
knowledge by examining how DM initially was 
part of the discourse on Knowledge Management 
with a strong technological emphasis (IT tech- 
nologies). Questioning the validity of knowledge 
generated with DM the relation between DM and 
decision making in organizations is analysed. This 
contribution explores implications of DM and its 
social (and network) impact. In the author’s ac- 
count, the paper relates itself to the handbook by 
highlighting features of DM in organisations and 
the discipline of Organisational Sciences under 
the perspective of knowledge creation and validity 
of DM based knowledge. 

Societal effects of technological advancement 
as enabler for DM, and their impact on globaliza- 
tion (e.g. Castells, 1996), are taken for granted. 
Reason being that the author is not interested in 
technologies and their contribution to globali- 
sation, but on the process by which validity is 
attached to the outcomes from DM (cp. Floyd, 
1992 a/b; D’ Adderio, 2002; Kallinikos, 2004). 

This contribution does not consider data qual- 
ity, while acknowledging data qualities eminent 
important role in the process of DM. This omission 
is justified by noting thatthe author explicitly refers 
in course of the paper to notation differences on 
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what is important for whom when the generation 
of data-marts is described (Markus, 2001). 


BACKGROUND: DATA MINING 
AND INFORMATION SYSTEMS 


In this part, terminological clarifications are given 
and the theoretical questions inspiring this article 
are unveiled. 


Information Systems 


The definition of IS used rests on the distinction 
between technical- and human-systems, and their 
interaction in socio-technical systems. IS are “[..] 
system[s] of communication between people. 
Information systems are systems involved in the 
gathering, processing, distribution and use of 
information. Information systems support human 
activity systems” (Beynon-Davies, 2002, p.4). 
Human activity systems are any kind of human 
interaction that is happening for a given purpose; 
public and private organisations represent human 
activity systems. Based on this perspective IS are 
divided into infrastructure and instruments (ap- 
plications) built onto this infrastructure (Beynon- 
Davies, 2002, p. 66).Applying this definition 
Management Information Systems (MIS) and En- 
terprise Resources Planning Applications (ERP), 
and applications utilised for the administration of 
databases themselves are of interest. 

MIS are information-oriented applications. 
Their purpose is structuring and controlling differ- 
ent processes in organisations (Turban, McLean, 
and Wetherbe, 2002, p. G-7). MIS are employed 
for standard operations and decision-making in 
organisations (ibid., p. 52). Beynon-Davies (2002) 
stipulates a close link between data used for MIS 
and Expert Systems (ES). Latter, are employed for 
non-routine activities with high levels of insecu- 
rity, as less knowledge is existent about the impacts 
and potential outcomes of actions (Beynon-Davies 
2002,v p. 92). ES (cp. Stahlknecht & Hasenkamp, 
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2002) attempt structuring information available 
to managers. ES are based on a multi-layered 
technical architecture using ontologies, repre- 
sented by semantic means (natural language), of 
a certain field of examination (Stahlknecht and 
Hasenkamp, 2002, pp. 435-439). ES’s automate 
the process of decision-making in order to allow 
management refocusing on managerial activities. 
During the late nineties of the last century, these 
expert systems became subject to strict scrutiny. 
With the social movement of knowledge manage- 
ment, it became apparent that ES are limited in 
their capabilities describing in unitary ways the 
universes’ of examination among experts from 
different domains (e.g. Pipek, Hinrichs & Wulf, 
2003; cp. Markus, 2001). 

Enterprise Resources Planning Applications 
(ERP) are similar to MIS. They emerged dur- 
ing the 1980s’ and extended the view of MIS 
by combining data collected for each domain 
(functional area of an organisation) into overall 
organisational data. Allowing, and facilitating, for 
the generation of ever more abstract data, senior- 
management ideally anticipates based on data the 
direction organisations are heading too. During 
the nineties of the last century ERP were enriched 
with workflow (process) steering elements (e.g. 
Kallinikos, 2004). 

A dedicated class of applications directly as- 
sociated with the creation of the data repositories 
are databases. Those exist in different varieties. 
According to Beynon-Davies (2002, p. 139; cp. 
Stahlknecht and Hasenkamp, 2002) a database is 
a logical description of a given object in line with 
the attributes that describe the object in question 
(Stahlknecht and Hasenkamp, 2002, p. 139-140). 
A database is an “[..] organised repository for data 
having similar properties” (Beynon-Davies, 2002, 
p. 139). Given this functional task description, 
high technological efforts are required to maintain 
the functions of data a) sharing, b) integration, c) 
integrity maintenance, d) security, e) abstraction, 
f) independence (ibid., p. 140). For data mining in 
particular the properties of data sharing, integra- 
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tion, abstraction and independence are important; 
they allow for new knowledge generation (cp. 
Blackwell and Ravda, w/o year, chapter 2). 

Database management systems (DBMS) 
facilitate these functions, and ensure structural 
maintenance, transaction processing, informa- 
tion retrieval, and finally data administration of 
databases. DBMS are the interface between the 
actual data storing and the data model with the 
normalised data description (Beynon-Davies 
2002, 140-141). 


DATA GENERATION 
FOR DATABASES 


Data normalisation is the process of stripping a 
given focal object of all its features and define 
thereby the properties (attributes, and tables) de- 
scribing and defining the objects characteristics 
(Stahlknechtand Hasenkamp, 2002, pp. 177-179; 
Beaulieu, 2005, pp. 4-6). Information systems and 
databases involve descriptions of real-life objects. 
In addition, a question rises how developers and 
users understand them. 

In the relevant IS literature, there exists criti- 
cism to this process of abstraction. Authors are 
disturbed with the process by which scientific com- 
puting develops and applies without questioning 
conceptional models underlying software design 
(e.g. Floyd, 1992 b; Capurro, 2008; Stahl and 
Brooke 2008). Some of this criticism is directed to 
the form of software design (Floyd, 1992 a), while 
others refer tothe involvementand apprehension of 
the uninvolved users into system design (Markus, 
1983; Curtis, Krasner and Iscoe, 1988; more recent 
Capurro, 2008; Stahl and Brooke, 2008). Another 
stream of criticism relates to the reality shaping 
forces inherent to IS (e.g. Orlikowski and Gash, 
1994). IS are not only technological objects rep- 
resenting a translation of social reality into code, 
but become part of a reality construction - IS and 
social reality are interdependent objects. This 
interdependency is validated and controlled for 


side effects (cp. Flyod, 1992 a; Orlikowski and 
Gash, 1994; Adler and Borys, 1996). Kallinikos 
(2004) and Isomäki (2002) show how much IS 
are based on rationalistic accounts humans by 
software designers. 


Data Mining 


Data mining has many different definitions, of 
which here two are given. First it is “[..] described 
as ‘the nontrivial extraction of implicit, previously 
unknown, and potentially useful information from 
data’ [Witten and Frank, 2005, XXIII] and ‘the 
science of extracting useful information from large 
data sets or databases. ‘Data mining in relation to 
enterprise resource planning is the statistical and 
logical analysis of large sets of transaction data, 
looking for patterns that can aid decision making.” 

The first definition of DM is guiding this paper. 
This is achieved is by applications that 


[..] sift through databases automatically, seeking 
regularities or patterns. Strong patterns, if found, 
will likely generalize to make accurate predictions 
on future data (Witten & Frank 2005). 


DM’s focus is unfolding of patterns in informa- 
tion contained in databases — or data warehouses 
—that are not immediately open to human percep- 
tion; it helps humans generating new knowledge 
from information in order to take decisions. DM 
isa method for information conversion in a triadic 
relationship of “data-information-knowledge” 
(Luft, 1994; Tuomi, 1999; Lee and Yang, 2000; 
Alavi and Leidner, 2001). For Witten and Frank 
(2005) DM is the process of “abstraction: taking 
the data, warts and all, and inferring whatever 
structure underlies it” (Witten and Frank, 2005, 
XXIII). Given this scope, it is not surprising that 
they focus on the process 


[..] inwhich the result of “learning” is an actual 


description of a structure that can be used to clas- 
sify examples. This structural description supports 
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explanation, understanding and prediction (Witten 
& Frank 2005). 


This is interestingly a very formal and close 
description of the term theory as it is maintained 
in many conceptions of scientific knowledge (e.g. 
Stacey, 2001). With this focus on information 
generation from existent data, the author takes 
databases as representations of normalised objects 
contained in reality allowing for their machine 
based manipulation. Against this background, a 
clarification is required how data used in DM are 
generated for automated data analysis. Keeping 
the abstraction process in mind (normalization), 
it is necessary outlining characteristics of data 
warehouses in the field of data mining. 


Data Warehouses and Data Marts 


“A data warehouse isa subject oriented, integrated, 
non-volatile, and time-variant collection of data 
in support of management’s decision-making 
process“ (Immon, 2002). A vendor defines data- 
warehouses as 


[..] a relational database that is designed for 
query and analysis rather than for transaction 
processing. It usually contains historical data 
derived from transaction data, but it can include 
data from other sources. It separates analysis 
workload from transaction workload and enables 
an organization to consolidate data from several 
sources. (Oracle 2008, 321). 


Quite often data warehouses are either filled 
by transactional data (originating in ERP and MIS 
systems; Online transaction processing - OLTP) 
or fed by inputs for Decisions Support Systems 
(DSS; for more generalised description of data 
warehouses and interactions with DM Beynon- 
Davies, 2002, pp. 456-457). Data warehouses’ 
content originates in OLTP. Data are more his- 
tory oriented making it common referring to data 
warehouses’ content as a predecessor for online 
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analytical processing (OLAP). OLTP data have 
to be condensated and normalised in order to be 
usable for OLAP. 

An OLAP is dependent on results from the 
different departmental OLTP, and the conversion 
normalisation of these data for the data warehouse. 
If this does not happen data inconsistencies occur 
that lead to skewed and wrong results (Chaudhuri 
& Dayal, 1997, chapter 2; see also Markus, 2001 
op cit). 

In particular during data cleansing for import- 
ing data into warehouses issues of data quality are 
of eminent importance. This is also the stage were 
organisational departments begin battling for the 
power describing overall organisational internal 
reality, and the environment in which it is acting 
(e.g. D’Adderio, 2002). 

In the field of Business Intelligence OLTP, data 
are referred too as data marts. Latter are extracts 
from the overall data warehouse. Data marts are 
repositories ofa given domain and data utilisation 
can be restricted in an organisation, but data are 
subject to the same data purification methods as 
those for data warehousing (Chaudhuri & Dayal 
1997). In Figure 1 the process of data extraction, 
normalisation, and upload to the warehouse is 
described in a schematic description. 


KNOWLEDGE AND VALIDITY: A 
METHODOLOGICAL ORIENTATION 
FOR DM 


In this chapter, the task is deciphering what 
knowledge is in the field of KM and what the 
relation between data mining and knowledge is. 
In order to do so the author sketches some of the 
arguments forwarded by Penrose. These arguments 
form the basis for the strategic management idea 
of Knowledge Management (Grant, 1996 a / b). 
The publication of Nonaka and Takeuchi (1995) 
and Leonard-Barton (1995) can be interpreted as 
the final stages in the process of the emergence 
of Knowledge Management as a dedicated field 
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Figure 1, Architecture of a data warehouse (Oracle, 2008, p. 324) 
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of research in Business Administration and / or 
Information System research. The task is checking 
whether Penrose sets the groundwork for Knowl- 
edge Management as it is regularly suggested. 


Penrose and the Knowledge Term 


Penrose’s understanding of knowledge cannot be 
isolated from the organisational form in which 
knowledge is dealt with. For her the firm is the 
“[..] ‘flesh and blood’ organizations that business- 
men call firms’(Penrose 1995, 13). 

Firms execute economic activities in ac- 
cordance with plans defining the utilisation of 
resources as contributions to the provisioning of 
services and goods to the economy in general. 
Behaviour of, and within, the firm is determined by 
these plans devised by an independent unit within 
the firm which is part of the overall bureaucracy of 
the corporation. This bureaucracy entails policies, 
decisions rights and the work distribution (ibid., 
pp. 15 — 17). Furthermore, the firm is described 
by the existence of “[..] productive resources” 
(Penrose 1995, 24) and their utilisation in line 
with plans. Resources are capable rendering ser- 
vices to the production process. For the analysis 
here the pair of terms of services and resources 


is enlightening, as ‘knowledge’s’ character is 
subsumed into this dichotomy. 
Penrose (1995) argues that 


[..] resources consists of a bundle of potential 
services and can [..] be defined independently of 
their use, while services cannot be so defined, the 
very word ‘service implying a function, an activity. 


Services inherent to a resource are exposed 
while utilising latter. Resource utilisation de- 
fines the value to the firm, which relate to other 
characteristics of resources. Resources can be 
bought in the market, produced in the firm, sold 
to the market, or produced and used by the firm 
(Penrose 1995, 24-25). 

She does not give a terminological definition 
of knowledge, but there is an alleged affinity to 
constructionist understandings of knowledge. This 
definitional lapse arises from the fact that Penrose 
is not concerned with the concept of knowledge 
per se. Rather, she attempts understanding how 
the application of resources and services rendered 
from them, is dependent on each other. 

In this analysis information gathering relates 
to uncertainty minimisation, while knowledge is 
contextualised with resources and services and 
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their development and utilisation. She differenti- 
ates two kinds of knowledge. One kind is acquired 
during formal education (“objective knowledge”), 
while a second results from individual learning 
processes (“experiences”). 

Objective knowledge is characterised by alevel 
of specify that is not to narrow and is shared by 
a sufficient large group of people that share this 
objective knowledge (Penrose 1995, 53). While 
people can have different degrees of apprehen- 
sion of this objective knowledge, differences of it 
result from communication processes (cp. Hinds 
& Pfeffer, 2003) they do not lead to dedicated 
different sets of knowledge. Knowledge acquired 
by the individual as result of learning processes is 
experience. Experience and knowledge are differ- 
ent for two reasons: first experience is bound to 
individuals “[..] it produces a change — frequently 
a subtle change — in individuals and [second, it] 
cannot be separated from them” (Penrose, 1995, 
p.53). Second, experiences are results of actions 
taken. Due to this action-orientation, experiences 
lead to an enhanced understanding of where 
and how objective knowledge is employable. 
Consequently, Penrose speaks of experiences of 
individuals if they have acquired new knowledge, 
and they employ objective knowledge in different 
ways (ibid.). 

Subjective uncertainty is defined by Penrose 
as a “[..] feeling that one has too little informa- 
tion [that] leads to a lack of confidence in the 
soundness of the judgments that lie behind any 
given plan of action” (Penrose 1995, 59). In this 
understanding information collected are means 
obtaining a better view on the potential sequences 
of activities taken during plan implementation. 
The acquisition of information 


[..] requires an input of resources, and to evalu- 
ate information requires the services of existing 
management. Therefore one of the important 
effects of subjective uncertainty is to induce a 

firm to devote resources to what might be termed 
‘managerial research’ (ibid). 
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| The amount of information gathered will wary 


~ onan individual firm basis. Even after a compre- 
hensive search, firms are taking risks - but on a 
much more informed basis and are thereby less 
uncertain. Over time firms become used to this 
type of information gathering. In turn specified 
amounts, and types, of information are selected 
by agreed upon “[..] defined procedures” (ibid., 
p. 60). These procedures satisfy the managerial 
groups’ need for information gathering; they be- 
come part in decision-making. The fulfilment of 
the procedure of information collection thereby 
ensures the validity of the information generated. 
Looking at data mining from this angle it 
becomes apparent that information aggregation 
in data warehouses and data normalisation is 
seemingly a predecessor activities to informa- 
tion processing via OLAP; and thus knowledge 
generation. It can be uncovered that DM has a 
relationship to knowledge whereby latter becomes 

| Operationally more ‘secure’, and plan implementa- 


į tion more risk resilient. 


According to Castells (2002), weare living ina 
global and network world that increases the amount 
of insecurity (cp. Dornier et al. 1998 for the field 
of operational and logistic planning). Under these 
conditions, DM becomes a “natural process” for 
the evaluation of potential negative impacts on or- 
ganisations action plan realisation. Paradoxically 
DM, and the information technologies used by it, 
contributes to the process of globalisation (cp. 
Castells; Dillon 2008 for the field of globalised 
securities markets). The question remains why in 
particular DM is so favourable evaluated for its 
knowledge generating capabilities, while other 
means are omitted from the evaluation of valu- 
able inputs to decision making in the individual 
organisation. 


First Generation Knowledge 
Management 


In this section main threads of the first phase of 
knowledge management endeavours are examined 
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and contrasted with the understanding of knowl- 
edge and information coming from Penrose. For 
Penrose knowledge in the first instance is avail- 
able, and used, on an individual level. Knowledge 
is conceived as a property of individual and be- 
longing to a given group to which a given set of 
‘objective knowledge’ is available; knowledge is 
not the central object of the analysis. That knowl- 
edge is a device for the realisation of other things 
needs to be emphasised, as this clearly sets her off 
from later proponents of KM, who treat knowledge 
as an asset (cp. Leonard-Barton, 1995, Nonaka 
& Takeuchi, 1995; Grant, 1996 a/b). Penrose is 
concerned with the capability of how knowledge 
is employed generating further services. 

Knowledge management, as elaborated by 
Nonakaand Takeuchi (1995) and Leonard-Barton 
(1995), rests on the assumption that knowledge 
brought to the organisational level is immediately 
accessible for taking actions. Often technical 
means are perceived sufficient to provide options 
for innovations. Pfeffer and Sutton (2000) take is- 
sue with the very consequence of the attitude that 
often corporations use Knowledge Management 
in the explicit/ tacit understanding (Nonaka and 
Takeuchi, 1995) leading to 


[...] to build stock[s] of knowledge [and infor- 
mation], acquiring or developing intellectual 
property (note the use of the term property) under 
the presumption that knowledge, once possessed, 
will be used appropriately and efficiently (Pfeffer 
& Sutton 2000, 16). 


Wilson (2002) argues that during the initial 
phases KM was examined in fields of computing 
and information systems, information science- 
information management and management. Most 
papers were publicised in journals about Decisions 
Support Systems. Often knowledge and informa- 
tion were used synonymous. Knowledge was 
conceived as a deeply individual, personalised 
item, referring to either action, in the meaning 
of executing a task, or to the human capacity 


to draw conclusions from information via the 
contextualisation - referring to the knowledge 
triade described above (Wilson, 2002 and Penrose 
op.cit.). Examining consecutively the definitions 
of KM offered by consulting corporations in 
the early 21* century many of the technologi- 
cal frameworks for data mining were treated as 
instances of KM. It seems that during the early 
days KM was set synonymous with DM and 
information distribution in organisations. Both 
activities are here discussed here with a critical 
eye to their reality content in terms of knowledge 
validity. Given the lack of terminological rigour 
as to what knowledge is, how practices look like 
that guide KM, and the heavy emphasis on IT, 
Wilson (2002) concludes that KM can described 
by a strong practical impetus. 

A hermeneutical reading of Penrose suggests 
a very different picture on a number of topics. 
In the first instance it becomes apparent that the 
social embedding of knowledge is recognised, 
and that seldom two people share exactly the 
same knowledge. This is due to the character of 
knowledge in its tripartite understanding (cp. Luft, 
1994; Tuomi, 1999; Lee & Yang, 2000; Alavi 
and Leidner, 2001). It seems that proponents of 
Knowledge Management work on the assumption 
originating in Penrose’s understanding that there 
is a set of objective knowledge available to all; a 
view to which Wilson has nothing to say except 
that itis a highly dubious conception. Furthermore 
this perspective on KM is wrong when observ- 
ing that other means are required allowing for 
Knowledge Sharing. 

Knowledge Sharing in its own right is as well 
short in explanatory power on how new knowledge 
comes into the world. Reason being: it is rarely 
taken into consideration which effects knowl- 
edge differences, based on socialisation, have 
on Knowledge Sharing (e.g. Pfeffer and Sutton, 
2000, Hinds and Pfeffer, 2003; Huysman and de 
Wit, 2003). Authors taking up the problem of 
information understanding, defined as conversion 
of information into knowledge, argue much more 
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based on information exchanges. Information 
exchanged are treated in data mining as inputs 
for the computation of knowledge (Witten and 
Frank, 2004). This holds in particular true for 
the works of Lichtenstein and Hunter (2005) and 
Markus (2001). It seems that authors arguing in 
the field of Knowledge Sharing employ the data- 
information-knowledge hierarchy. 

Organisational Knowledge, as another field 
that has an enormous interest in IS, and thus the 
data mining, shows in a much more pronounced 
way some of the difficulties. In particular it is 
the indecisiveness whether organisations have 
knowledge of their own, or whether employees, as 
representatives of the organisation, exchange their 
knowledge that is in turn taken as organisational 
knowledge (e.g. Nelson and Winter, 1982; Jones, 
1995; Robey, Boudreau and Rose, 2000; Stacey, 
2001; Lehesvirta, 2004). 

This admittedly cursory overview of KM in- 
dicates that DM based knowledge generation has 
problematic features (cp. Dennis et al., 1998 for 
the example of knowledge conveyance capacities 
of different media; Markus, 2001; Mulder, 2004). 
Referring back the definition of IS, it becomes 
apparent that IS and knowledge are seemingly 
dependent on each other in the light if data min- 
ing. Moving a step further it is argued that data 
stored in the data marts or data warehouses, of 
organisations represent very different, point of 
view dependent, interpretations of organisations 
themselves and their environment. 

At this stage it becomes apparent that this 
chapter contributes to the content of this Handbook 
by developing a genealogy on the emergence of 
DM based knowledge generation as result of the 
availability of different technologies, and how 
those are used to develop accounts of reality in 
which organisations (be those private or public) 
act. Furthermore, this chapter attempts showing 
where and whether classical organisational theo- 
ries (here Penrose, but also Nelson and Winter, 
1982 could be taken), can be considered possible 
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ancestors to DM based KM —and thus knowledge 
generation and assigning validity to it. 


METHODS FOR KNOWLEDGE 
GENERATION FROM ‘OPERATIONS’ 


Under this heading it is attempted understanding 
why in organisational sciences data mining has 
received so much attention, and as important, why 
knowledge created by non-quantitative methods 
is somewhat neglected. 


The Rational-Empirical Legacy 


Science has followed, according to Toulmin 
(1990) and Nowotny et al. (2004), a very long 
time the depersonalised and rational conception 
of knowledge creation with an emphasis on the 
general (Toulmin, 1990). If looking into the field 
of organisational sciences the same holds true and 
there even more prominent (e.g. Mintzberg 1978: 
and the whole classic economics oriented school 
of Principal — Agent theory and the related field 
of Transaction Cost Economics — e.g. Eisenhardt 
1989b; Holmstrom & Tirole 1989; Willimason 
1991; Tsoukas & Cummings 1997). 

Tsoukas and Cummings (1997) refer to this 
development in, and of, organisational studies as 
the desire of management sciences being treated 
as a science based discipline. In an academic strict 
understanding of sciences, organisational studies 
begin during the 19" century. Tsoukas and Cum- 
mings suggest that the rather strict formalistic 
approach toward knowledge and science was 
selected to achieve recognition of the discipline 
towards the end of the 19 century (cp. Turner, 
2001, pp. 47-51 fora discussion of methodological 
choices during the development of the academic 
sector during the 19" century). It is suggested that 
the field of organisational sciences begins con- 
sidering management as ‘management sciences’ 
only after the full implementation of rationalism 
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as way of doing sciences in the wake of the In- 
dustrial Revolutions. 

This phase represents historically the combi- 
nation of technology and ‘management methods’ 
(Tsoukas & Cummings, 1997). Toulmin (1990) 
argues that many of the main tenants of rational 
knowledge (sciences) - written form, universal 
scope and search for appliance, a temporal valid- 
ity (Toulmin 1990, 30-35) - are reflections of the 
attemptto minimise humans’ fallacies. Formalism 
is a mean overcoming and dealing with fallacies 
of humans’ cognition and ambitions as exposed 
during the Thirty Year War (Toulmin 1990, 54-56). 

Thus, a first hint on the supremacy of rational 
arguments relates back to the time of the beginning 
18" century and the academic endeavour looking 
for eternal truth (cp. Tsoukas & Cummings 1997; 
Foucault 2006a, 394-400). How does DM fit into 
organisational sciences when it is based on the 
particular data sets of the individual organisa- 
tion- a move that represents almost a turnaround 
of the rational ethos? In the wider discourse on 
rationality information are referred to as a resource 
minimising risk, and thus to the concept of secu- 
rity. This resembles Penrose’s understanding of 
information as a source for taking action on a more 
informed way by gaining better understanding of 
the consequences of actions taken. Security, in 
line with Foucault (2006 a/b), is one of the main 
tenants of the modern way of thinking in liberal 
democratic states - but security takes for Foucault 
a very different form that is elaborated below in 
more detail. 

One of the main properties of knowledge 
generated during the 18" and 19" century is its 
foundation in an academic setting under condi- 
tions of academic peer review. This review, and 
the methods employed for knowledge generation, 
established the validity. There were no, or little, 
external locations that engaged in the production 
of knowledge. 

As result of the industrial revolution, and 
latest after World War II, Nowotny et al. (2004) 
observe that knowledge generation moved out 


of the traditional, state-sanctioned, predominant 
field of academic research into the individual 
(industrial-)organisation (Nowotny etal. 2004, pp. 
66-75 and pp. 80-84). In particular the commercial 
oriented production of knowledge, while adhering 
to procedures resting on rational-methodological 
empirical methods, has led to a re-orientation 
to the particular organisation (ibid. pp. 88-90). 
Knowledge becomes particular, and validity - 
results of research conducted — is assigned to the 
outcomes of research by their fit to the purpose 
for which solutions were looked for (Nelson & 
Winter, 1982 have labelled this process “search”). 
“Knowledge” generation becomes temporal, 
oriented to the particular, and “personalised” to 
the employee working in a given process and 
conducting the inquiry — and mediated by this 
process knowledge becomes organisational (cp. 
Nelson & Winter, 1982, 247-250). 

In line with the overall move towards con- 
structionist bases, it starts with Kuhn (1996) and 
then reaches the late 20" and early 21* century 
management sciences (e.g. Von Krogh, Roos & 
Slocum, 1994; Leonhard-Barton, 1995; Nonaka & 
Takeuchi, 1995). Thus, the discipline of organisa- 
tional sciences begins re-appropriating qualitative 
methods by its interdisciplinary character (for a 
discussion of different disciplinarily understand- 
ings, Thompson Klein 1996; Krone 2007, 23-24). 
Simultaneously, internal to organisations, the 
old ‘rationalistic-empirical’ mode of knowledge 
generation is favoured — in particular in the math- 
ematics and formulae oriented way (e.g. Tsoukas 
& Cummings, 1997). So, where is DM situated 
in this dialogue between technological inspired 
knowledge creation and the quest for the general? 


DM and the Particular: 
Methodology for the General 


Data mining is oriented to individual organisa- 
tion’s knowledge creation. This does not happen 
consciously. Rather, this ‘individualised’ form of 
knowledge creation is part of the appropriation 


of IS in due course of their implementation in the 
organisation (cp. Swanson, 1994; Krone, 2007). 
Due to the character of IS, being based on ‘Best 
Practice’ from a wide array of industries and many 
implementations, they have a general appeal, lay- 
out and organisational model included (Kallinikos, 
2004, 22). From a practitioners perspective DM is 
seemingly oriented to the creation of general valid, 
a-temporal knowledge. This dualism is identified 
when recognising ERP’s foundation on software 
engineers idiosyncratic understanding of humans 
and the task at hand in the individual organisation 
where ERP are implemented (Isomäki 2002, 184- 
187; Kallinkos 2004, 9-11). Based on Penrose’s” 
idea that information minimise reluctance in tak- i 
ing actions the rise of DM becomes explainable 
in line with the traditional self-understanding of | 
economics and the field oforganisational sciences v 
as a very much-related field of studies. 

Economics as a discipline is about the most 
efficient allocation of resources in a given set- 
ting under constrains of scarcity. In line with 
this understanding, DM’s property of allowing 
developing new modes of conducting business 
make it attractive planning resources deployment 
in operations (Luan, 2002 for the academic field; 
Sartipi, Yarmand and Down, 2007 for the field 
of Electronic Health Provision; Blackwell and 
Ravada, w/o). With the aid of DM organisations 
develop and improve their internal procedures 
by ‘technological’ task execution based on data 
originating in OLTP. For scholars conducting 
research in the field of organisational sciences, 
the question to be asked is: why there is no outcry 
about the return to the particular? 

It shall suffice suggesting that insofar as 
data mining, and the infrastructural applications 
leading to raw data used for machine learning, is 
constructed with a rational impetus it reinforces a 
rationale world perspective (Floyd, 1992; Tsoukas 
and Cummings 1997; Kallinikos 2004). Results 
gained from the particular organisation are per- 
ceived generally valid, since methods used in 
DMare generally valid ones. Knowledge creation 
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becomes an activity isolated from the general 
discourse of knowledge validity; or rather, the 
methods assure validity as such (cp. Penrose, op. 
cit.). Knowledge created in commercial settings 
adheres to the same principles as that generated 
for scientific purposes, so one could argue: It 
is rational science with the general in focus, so 
why bother (Nowotny et al., 2004, pp. 198-200 
and p. 223). The remaining question to be asked 
becomes: ifthe particular results from data mining 
are not challenged in their validity, why the same 
status is not granted to knowledge that is created 
by non-quantitative methods? 


UNCOVERING THE RATIONAL 
IN QUALITATIVE METHODS 


Following Toulmin (1990), Kvale (1995), 
Czarniawska(-Joerges) (1998, 2001), Sandberg 
(2005) qualitative methods, and thus as “weak” 
perceived information, are critisised for failing 
minimising impacts of human accounts of organi- 
sational reality in the knowledge creation process. 


When later engaged in qualitative research I 
encountered the positivist trinity [...] used by 
mainstream researchers to disqualify qualitative 
research. [..,]” The results are not reliable, they 
are produced by leading interview questions”; 
“The results are not to be generalized, there are 
too few interview subjects”; and “The interview 
findings are not valid, how can youknow if you find 
out what the person really means? (Kvale 2005). 


Kvaleshows that these weaknesses are endemic 
to knowledge creation with qualitative methods 
(Kvale, 1995). For him qualitative and quanti- 
tative oriented methods are not per se right or 
wrong. They deviate in their scope of application 
and intentions of explanation. Following Kuhn 
(1996) there is not so much difference in respect 
to the ways in which knowledge is generated, but 
rather how relevant others are convinced that a 
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dedicated set of knowledge is valid (understood as 
reality description that is shared in a given social 
setting; cp. Berger & Luckmann 1990; Barnes 
1995; Stacey 2001). 


Peculiarities of Qualitative Research 


For Kvale (1995) quantitative and qualitative 
forms of knowledge creation have their subjective 
influences. Under conditions of constructionist 
knowledge generation, here exemplified on the 
example of theories, knowledge sets are considered 
valid if they recognise: 


Validity as Quality of Craftsmanship 


The validity of research result arises from the ca- 
pacity of the researcher to remain within the limits 
of his/ her original research. This includes a con- 
stant process of self-reflection by the researcher. 
He/she is asking him-/herself whether methods 
used are correctly applied, and whether potentially 
other methods might be of benefit. Researcher 
become part of their findings in that these reflect 
the integrity of the researcher. Knowledge claims 
generated by qualitative research are valid ifthey 
expose the representativity of object researched. 
The research process, and the presentation of find- 
ings, is characterised by a constant questioning 
of what and why should be presented as studies 
content (Kvale 1995). 


Communicative Validity 


This criterion for validity involves the requirement 
for dialogue among researchers and the social 
environment into which he/she is integrated. A 
discussion about the reality in which the dialogue 
partners live ideally takes place. Valid knowledge 
is generated when knowledge claims are weighted 
against each other and author’s results are con- 
sidered trustworthy by peers. The trustworthiness 
of knowledge is evaluated by asking about the 
character of the discourse — the how of the argu- 


mentation. When considering the criteria under 
which the discourse is held and when something 
is considered true - the why- Kvale refers to ele- 
ments like e.g. “[..] conssousseurship and criticism, 
accepting the personal, literary and even poetic as 
valid sources of knowledge“ (Kvale 1995). The 
dialogue itself—the who—is held among research- 
ers, but should include the wider audience. Under 
conditions of postmodernity valid knowledge is 
produced by an inter-subjective validation of the 
content of knowledge claims. 


Pragmatic Validity 


In line with the pragmatists’ view that knowledge 
is there to obtain better action options, knowledge 
validity is achieved in allowing for better actions. 
The audience of the results of researchers’ work 
assigns validity to a knowledge claim. Applying 
the criterion of pragmatic validity, communicative 
validity becomes obsolete. The pragmatic valid- 
ity criteria include ethical aspects of knowledge 
utilisation. Under these conditions the how of 
the validity is established by different measures, 
pending on the expected outcome. Knowledge 
claims can be valid if they are accompanied by 
actions taken in line with the claim. In a second 
form, a knowledge claim is valid when guiding to 
different action options by the knowledge appli- 
cant. The cause of actions — the why — is whether 
knowledge can lead to better, and ethical, actions. 
Given the process of defining validity, the why on 
the knowledge set, leads to the question whether 
this allows for change. Validity of knowledge in 
line with pragmatic criteria is established by a 
dialogue among the researcher and users of it — 
the who of its establishment. 


Positivistic Sciences and 
the Personal Element 


Reconsidering knowledge generation in organi- 


sational sciences Tsoukas and Cummings (1997; 
similar Toulmin, 1990 on amore general perspec- 
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tive) show that in lockstep with the emergence 
of rational sciences the art of story telling in 
personalised - human oriented - ways vanished. 
However, these forms of knowledge did not van- 
ish as such, but they became subject to strong 
criticism for their ‘unscientific character’ in the 
eyes of rationalists. 

Following Kuhn, Barnes and other authors in 
the constructionistic ‘camp’, qualitative methods 
imply and embody as strict rules as quantitative 
rules. Often these qualitative methods are used 
masking personal opinions. Implicitly researchers 
act thereby less ethical, in respect to the research 
field at hand (Bridges 2003; Sandberg 2005). 
Bridges (2003) uncovers that there is no inferiority 
of educational science knowledge (as both fields 
are concerned with humans the author stipulates 
here an analogy between organisational sciences 
and educational sciences) created by qualitative 
methods (Bridges, 2003). Scientists employing 
these methods regularly tend to work less ethical 
by not complying too standards of scientific work, 
a point that was made a long time ago by Ravetz 
(1996). He argues that with the increasing neces- 
sary investments in the field of (natural) sciences 
ethical issues increase. Reason is that research 
investments exceed individual researchers’ bud- 
gets, opening the avenue to industrial oriented 
research. This form of research, according to 
Nowotny et al. (2003, 15-20), adheres to the logic 
of commercialisation and demands for different 
rules then academic for research (Ravetz 1996; 
and for the ethical conditions that lead to ‘good 
research’, see pp. 37-44). 

When the topic of (organisational) knowl- 
edge in organisations is added to the discourse 
of methodology, and the discussion is confined 
to the field of qualitative or quantitative methods 
themselves, the background description of data 
mining should be recalled. The argument was 
that knowledge generation by means of machine 
learning happens based on human defined pre- 
rules (Witten & Frank, 2005, pp. 4-6, 30-35). 
As the rules developed by machines are human 
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made in respect to the search algorithms to be 
looked for, and the data collected and captured 
for DM, Witten and Frank argue that knowledge 
sets resulting from machine learning have to be 
subject to words of attention. Reason being, that 
those are based on language and search biases 
(‘overfitting-avoidance bias’ exists but these are 
less relevant for the paper at hand; Witten and 
Frank, 2005, pp. 32-4). 

The language bias takes its origin in the ques- 
tion whether natural language methods should be 
used in order to describe findings, and whether 
there are limits in what should be accepted as 
knowledge sets originating in the machine learning 
process. Included in this bias is the question how, 
and whether, results from machine learning can 
be expressed in a natural language, or not, due to 
knowledge deficits about the domain and language 
used in the domain (ibid. p. 32-33; Krone, 2007). 
Thus, as a side note one comes back questioning 
whether and how Expert Systems can produce 
general understandable and applicable solutions 
for managerial decision-making. 

Search bias refers to the problem that data 
(knowledge sets) gathered via data mining are, in 
a Statistical sense, attempts fitting data collected 
to the problem at hand. Data are not assessed in 
their breadth of potential applications and avail- 
able different readings, but in line with a human 
inspired maximum range of potential solutions. 
This process confines the heuristic of the machine 
learning processes. Potentially not all possible 
interpretations of the data are obtained. Problems 
like this are avoided when rechecking data after 
initial results were used. Manually reshaping 
results in order to prevent narrow results can 
happen. Search regimes (general to specific, or 
vice versa) can lead to skewed data that later have 
to be reconsidered (Witten & Frank, pp. 33-34). 

Considering data mining under this perspective 
not only the understanding of social environment 
as embedded in Information Systems (cp. Floyd 
1992 a/b; D’Adderio 2002; Isomäki 2002; Kal- 
linikos 2004) is problematic, but also the form 
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“F knowledge generation and the principles the 
snowledge output adhere too. The eminent ques- 
Son is again: Why using quantitative methods 
at are resting on individuals’ understanding of 
2 given field? Hence, as related to the aspect of 
organisational learning the question is: why these 
mechanised procedure for knowledge generation 


a organisational settings? 


TRENDS OF DM AND KNOWLEDGE: 
SECURITY AND ORGANISATIONS 


First technical and organisational boundary condi- 
tions for data mining were presented. Emphasised 
was developing an understanding that the data 
mining rests on different sources of data derived 
from human made Information Systems, and is 
thus necessarily subjective in many respects. Data 
mining can be defined as the process of using 
data generated from different sources in order to 
produce forward-looking predictions of causes of 
events during plan execution in any organisation. 

Knowledge is described in Penrose (1995) 
perspective as principally shared in a given com- 
munity (objective knowledge), but due to work 
experience giving rise to different application 
fields of it (experience). Knowledge becomes a 
resource rendering different services to different 
people. Information represents means for taking 
decisions under conditions of uncertainty, by col- 
lecting new/additional information. The load of 
information gathered by organisational staff will 
vary, but whichever amount of information will 
suffice when collected by means defined by the 
organisation to be methodological sound in order 
to overcome uncertainty. 

On the methodological diversity element, the 
argument is made that not necessarily qualitative 
methods are inferior to quantitative ones. Science, 
and scientific knowledge creation, can deal with 
different approaches for knowledge generation. 
Problems of knowledge generation, and thus 
validity, are not inherent to methods, but rather 


to the process and intentions when using a given 
meta-methodological choice. With Nowotny et 
al. (2004) and Ravetz (1996) the two extremes 
of this continuum are presented. Common thread 
of these authors is showing that humans conduct 
research, and therefore deficits in the utilisation 
of both methodological stance are relating back 
to scientific practices of the host setting in which 
they are embedded -the cutting line between ‘sci- 
ence’ and commercial oriented ‘applied science’ 
(Ravetz 1996). 

What can then explain the strong drive in more 
and more organisations to create knowledge by 
means of the data mining? Going a step back to 
the understanding of information as minimising 
insecurity in taking-actions for organisation an 
important term is given namely security. 


Economising and 
Normalising for Security 


Security understood with Penrose is about the 
anticipations of the consequences of a given 
course of actions:(Penrose, 1995, pp. 58-60). A 
different understanding of security is available in 
the social sciences and re-emerging only recently 
as important. It is the understanding developed 
by Michel Foucault. 

In his account security as an action, guiding cat- 
egory coincides with the emergence of the modern 
rational science discourse (cp. Foucault 2006a/b; 
cp. Toulmin 1990). It is necessary to trace some of 
elements of security in the Foucault’s understand- 
ing, as this allows answering the second research 
question. Toulmin (1990) makes the compelling 
argument that rationality and security have to be 
thought together when describing Descarte’s and 
other rationalists view that the firm principles of 
rationality are expressions of nature like laws that 
should be discovered by scientific ways (Toulmin, 
1990, pp. 129-131; Foucault, 1996 a, pp. 428-43 1). 
In this view, natures like laws are expressions 
of strict and strong hierarchies that hold true for 
societies (Toulmin, pp. 132-135). 
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This argument nicely matches with the insti- 
tutionalisation of the “rationalisation” of life as a 
form to comply with, and to, the emergent market 
order that swept away the old middle-aged feudal 
form of government (Foucault 2006a, 332-339). 
This kind of life goes beyond the individual mar- 
ket participant (cp. Foucault, 2006 a, pp 93-96; 
Foucault, 2006 b, pp. 390-2; Dillon, 2008, 317). 
For Foucault the market, and the way it was 
becoming ubiquitous in the western-world, is a 
mechanism by which the state attempts to secure 
its own society on the one side, and on the other 
delimits its own action options in order to ensure 
security (Foucault 2006a, 105-108, 394-402; cp. 
Dillon 2008, 310). Dillon (2008) extrapolates this 
Foucault’s idea and suggests that nowadays life 
has become unsecurable against the contingencies 
of its own development. 


[..BJiopolitically speaking, contingency is consti- 
tutive of what it is to be a living thing, the referent 
of object of biopolitics — life — cannot be secured 
against contingency. Biopolitically, it is instead 
secured through contingency (Dillon, 2008, 310). 


Life is secured by gambling on contingencies 
of events that may occur at some point in time. 
By this means life is virtualised as a variable in 
the overall calculation of probabilities that form 
the motor of today’s derivative oriented financial 
economics (Dillon, 2008, 311, 326-329). Given 
that these derivative oriented financial economics 
are based on data mining, and the development 
of rules of statistics (Foucault, 2006 a, 90-8, in 
particular p. 95-98), itis duerethinking the security 
and the data mining as interdependent objects. 

According to Foucault (2006), the success of 
the statistical method relies on the definition of 
objects — cities, citizens, states — as abstracted 
‘normalised’ neutral objects circulating ina given 
territory. Similarly, Witten and Frank (2005) argue 
to some extent that the results of data mining are 
matched like statistical results against a given 
description of reality. For Foucault, by initially 
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applying the statistical method, research was 
concentrated on those measures allowing mini- 
mising deviations from the declared normalised 
status (Foucault, 1996 a, p. 96-97). Thus, the re- 
interpretation of data mining in line with such an 
understanding of security, and further the creation 
of knowledge resting on normalised data from 
OLAP, can be taken for understanding the em- 
phasis given to machine learned knowledge sets. 


A Security Oriented Genealogy 
of Organisations 


When overseeing the literature on organisational 
design and management (cp. Mintzberg, 1978; 
Eisenhardt 1989a; Ghoshal & Moran, 1996; Gib- 
son etal. 2003) itisno exaggeration suggesting that 
much emphasis is on the topic of normalising the 
operations and procedures, and maintain a status 
of optimised flow through of products, or service 
provision. Insofar as the focus is on the optimised 
flow-through (the double-edged perspective of this 
word when remembering Foucault is interesting) 
there is a strong impetus cutting down on vari- 
ability in task execution on the side of employees 
(March & Simon 1958, 29; Mintzberg 1978; Gib- 
son et al. 357-362). Adler and Borys (1996), Sia 
et al. (2002) and Kallinikos (2004) argue that in 
particular ERP and MIS are means dramatically 
cutting-down on employees’ autonomy in task 
execution- even if declared otherwise. 
Furthermore, the management science prin- 
cipal — agent model can be used showing that in 
management sciences there is a constant thrive 
to detect and apply models allowing minimising 
risks of individual autarkic behaviour by enfram- 
ing it ‘technologically’ (e.g. Eisenhardt 1989b:; 
Holmstrome & Tirole, 1989). When this admit- 
tedly short overview is considered valid, the role 
of IS in organisations also can be read in the light 
of security in Foucault’ian terms, which answers 
the second research question. 
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CONCLUSION: DM AND 
SECURITY OR PREDICTABILITY 
AND IGNORANCE 


Extending the argument of the previous sub-chap- 
ter IS are means that ensure organisational security 
by allowing for predictability by cutting-down in 
variability of actions carried out by employees. 
The data mining is then the automated process by 
which information from the IS are converted into 
knowledge and valid “per se”. 

Qualitative- human oriented modes of 
knowledge creation are suspect to fallacies, and 
subjectivities, in line with the rational emphasis 
of organisational sciences that goes hand in hand 
with security concerns (Foucault 2006a/b; Floyd 
1992a/b; Capurro 2008 for the critical analysis of 
dehumanised and unreflective IS design; Tsoukas 
& Cumming, 1997 in organisational sciences). 
In more abstract terms, this leads to a process, in 
which the human element is squashed in order to 
allow for smooth operations. Interesting aspect 
in this de-humanisation process of IS in organi- 
sations is the emphasis and strictness. This, on 
the one-hand side the market element (customer 
focus in organisations for better action options — 
the Penrose’an interpretation of information), is 
favoured while on the other hand results of data 
mining are not questioned (Witt & Frank 2005, 
35-37; Floyd 1992 a with her call for a reflective- 
discursive approach toward IS). 

Adding a further step on this sceptical account 
on the autonomy of data mining based knowledge, 
Capurro (2008) stresses the ethical problems of 
IT in more general. One part of these ethics is 


[..T]o learn not just to store, retrieve, and man- 
age information but to become aware that what 
we do is handle with biased knowledge, i.e. that 
our basic ability in an information society should 
be a hermeneutical one, which includes such 
critical arts as the interpretation, aesthetic or 
creative design, and responsibility towards our 
lives (Capurro, 2008). 


Consequently, the ‘self’, understood as a me- 
diator between the “ego” and the environment, 
has to be strengthened and recaptured against and 
with IT. Taking a step back, it becomes apparent 
that the exclusion of qualitative methods within 
the field of organisational sciences leads to an 
extended securitisation and abstraction from the 
human element. By these means, consumers of 
data mining based knowledge become subject to 
self-deception in respect to the validity and neutral 
character of knowledge generated. Reason is that 
ERP, MIS, and their respective data pools, are me- 
diated representations of designers’ understanding 
of the world we are living in. 

The challenge is that even knowledge generated 
from machine learning has a human core —medi- 
ated by the IS design and the DM in particular- of 
which users should be aware. In data mining, the 
level of normalisation of lives is biggest, and the 
alleged level of neutrality of knowledge sets most 
given. Witt and Frank (2005) have shown, even if 
data warehouses and machine learning are means 
by which knowledge is gathered, some inference 
and/ or deductive means are to be borne - humans 
define the universe of inquiry about which they 
want to learn something. 
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KEY TERMS AND DEFINITIONS 


Data Mining: Process of collecting and ana- 
lysing data originating from IS for underlying 
structures that can inform about patterns invis- 
ible to human perception; predominantly applied 
against abstracted data 

Information Systems: Combination of tech- 
nological and human systems in order to facilitate 
for the exchange of information; different forms 
exist, e.g. DSS, ERP, MIS 

Knowledge Creation & Methodology: Pro- 
cess by which a given sequences of procedures 
is taken in order to interpret data or information 
for the creation of new insight 
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Organisation: Human generated and enacted 
social object, serving a given purpose for a not 
defined amount of time with an internally defined 
structure 

Qualitative Methods: A given set of proce- 
dures for the generation of knowledge attempting 
minimising human impact on knowledge out- 
comes; formal oriented uses abstractions 


182 


On Data Mining and Knowledge 


Security: Taking actions in an informed way 
and being aware of the consequences these ac- 
tions will have. 

Validity: An outcome of knowledge opera- 
tions representing reality in an uncontested way 
when humans using different knowledge creation 
methodologies engage in dialogue 


Section 3 
Data Mining in Organizational 
Situations to Prepare 
and Forecast 
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ABSTRACT 


In this study, two data mining based models are proposed for crude oil price analysis and forecasting, 
one of which is a hybrid wavelet decomposition and support vector Machine (SVM) model and the other 
isan OECD petroleum inventory levels based wavelet neural network model (WNN). These models utilize 
support vector regression (SVR) and artificial neural network (ANN) technique Jor crude oil prediction 
and are made comparison with other forecasting models, respectively. Empirical results show that the 
proposed nonlinear models can improve the performance of oil price forecasting. The findings of this 
research are useful for private organizations and governmental agencies to take either preventive or 
corrective actions to reduce the impact of large fluctuation in crude oil markets, and demonstrate that 
the implications of data mining in public and private sectors and government agencies are promising 
for analyzing and predicting on the basis of data. . 
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INTRODUCTION 


The need for both private organizations and 
government agencies to utilize data in public 
and private sector activities is increasing in re- 
cent years, including collecting and managing 
the data, analyzing and predicting on the basis 
of data. For example, the manager of an energy 
sector, in order to make the right decisions, must 
know the expectation of international crude oil 
price, energy supply and demand. Simultane- 
ously, he should know the main factors affecting 
oil fluctuation, demand and the supply capacity. 
Both the strategic and operational decisions of an 
organization or government agencies require the 
exploration of the current relationship among all 
the factors and construction of forecast models. 

As a source of energy and chemical raw ma- 
terials, crude oil plays an important role in the 
development of world economy. In recent years, 
the fluctuation of crude oil price becomes larger 
and larger, which not only directly affect global 
economic activities, but also bring risk to the oil- 
related enterprises and investors. Crude oil price 
is emerging as one of the hottest topics in the 
world. Influenced by many complicated factors, 
however, oil prices appear highly nonlinear and 
even chaotic (Panas and Ninni, 2000; Adrangi 
et al., 2001), which makes it rather difficult to 
forecast the future oil prices. 

Although oil price forecasting is very difficult, 
it has fascinated many academic researchers and 
business practitioners in the past few decades. 
There have been substantial literatures on analysis 
and forecast of crude oil prices including qualita- 
tive and quantitative methods, on the basis of which 
many decisions with regard to oil prices have to 
be made (Fan et al. 2008). Among the qualitative 
methods, Nelson et al. (1994) used the Delphi 
method to predict oil prices for the California En- 
ergy Commission. Abramson and Finizza (1991) 
used belief networks, aclass of knowledge-based 
models, to forecast crude oil prices. 


Besides these qualitative methods, a large 
number of quantitative methods and models are 
developed to analyze and forecast crude oil prices. 
According to Zhang et al. (2008), the quantitative 
methods can be grouped into two categories: struc- 
ture models and data-driven methods. Standard 
structure models outline the world oil market and 
then analyze the oil price volatility in terms of a 
supply-demand equilibrium schedule (Zhang et 
al. 2008). For example, Bacon (1991) discussed 
the factors determining the demand of oil, the 
supply of oil by OPEC and non-OPEC countries, 
and gave the forecast of crude oil prices. Al Faris 
(1991) analyzed the determinants of crude oil 
price adjustment in the world petroleum market. 
Data-driven methods include various models 
and approaches, such as traditional time series 
methods, econometric models and data mining 
techniques. 

There are abundant studies on crude oil price 
prediction using time series and econometric meth- 
ods. Huntington (1994) applied a sophisticated 
econometric model to predict crude oil prices in 
the 1980s. Abramson and Finizza(1995) utilized a 
probabilistic model for predicting oil prices. Gulen 
(1998) used co-integration analysis to predict the 
West Texas Intermediate (WTI) price. Barone- 
Adesi et al. (1998) suggested a semi-parametric 
approach for oil price prediction. Similarly, Mo- 
rana (2001) offered a semi-parametric method 
for short-term oil price forecasting based on the 
GARCH properties of crude oil price. In a more 
recent study by Ye et al. (2002, 2005 and 2006), 
some short-term forecasting models of monthly 
WTI crude oil spot prices using OECD petroleum 
inventory levels are proposed. Lanza et al. (2005) 
investigated crude oil and oil products’ prices using 
error correction models (ECM). Sadorsky (2006) 
used several different univariate and multivariate 
modelssuch as TGARCH and GARCH toestimate 
forecasts of daily volatility in petroleum futures 
price returns. 

As mentioned in Yu etal. (2008), the traditional 
time series and econometric models can provide 
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good prediction results when the time series is 
linear or near linear. However, a great deal of 
nonlinearity and irregularity exists in crude oil 
price series and numerous experiments have 
demonstrated the poor performance of traditional 
statistical and econometric models. Hence, the 
exploration of forecasting model based on data 
mining techniques attracts much attention from 
researchers. Recent work by a number of study 
(Kaboudan, 2001; Mirmirani and Li, 2004; Xie 
et al. 2006; Shambora and Rossiter, 2007; Yu et 
al. 2007) has shown that data mining techniques 
may provide potential solutions to crude oil price 
prediction. Kaboudan (2001) employed GP and 
ANN to forecast crude oil price. Similarly, Mirmi- 
rani and Li (2004) offered the ANN model with 
genetic algorithm (GA) to predict crude oil price 
and compared the results with the VAR model. Xie 
etal. (2006) presented a support vector regression 
(SVR) model to predict crude oil price. Shambora 
and Rossiter (2007) suggested the ANN model to 
predict crude oil price, and Yu et al. (2007) also 
used the ANN ensemble model to predict crude 
oil price. Meanwhile, some hybrid methods using 
data mining have been used to predict crude oil 
price and obtain the satisfied performances. For 
example, Wang et al. (2004) developed a hybrid 
approach by means of a systematic integration 
of ANN and rule-based expert system, with web 
text mining, to predict crude oil price. Wang et al. 
(2005) proposed a TEI@I methodology for crude 
oil price forecasting and obtained good prediction 
performance. 

Although some data mining techniques includ- 
ing ANN (e.g., Shambora and Rossiter 2007) 
and SVM (e.g., Xie et al. 2006) have been used 
to forecast crude oil prices, there are still some 
difficulties in data mining forecasting. First of 
all, most of forecasting models failed to produce 
the consistently good results due to the nonlinear 
mechanism and intrinsic complexity of crude oil 
market. In the past, the crude oil price was usually 
treated as a single series, the intrinsic complex 
modes involved in the price series are mixed and 
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can not be deep explored. Secondly, as petroleum 
inventory levels provide a good market barometer 
of crude oil price fluctuation in the short run, the 
relationship between petroleum inventory levels 
and crude oil prices has been studied by many 
researchers and been found nonlinear (Ye et 
al. 2002). However, the nonlinear relationships 
between inventory and crude oil prices may be 
more complicated than that suggested by Ye et 
al (2002). Moreover, it is also known that some 
main geopolitical events like 9-11 in USA affects 
crude oil prices significantly with variable length 
of influence or impulse functions. 

Based upon the above two aspects, it is nec- 
essary to introduce new data mining forecasting 
models for crude oil prediction. For this reason. 
this chapter will formulate two data mining fore- 
casting models in an attempt to overcome the 
two main difficulties mentioned above. One is a 
hybrid wavelet decomposition and SVM model 
and the other is an OECD petroleum inventory 
levels based wavelet neural network model. These 
models utilize support vector regression (SVR) 
and artificial neural network (ANN) technique 
for crude oil prediction. The main objectives of 
this chapter are as follows: (1) to show how to 
construct the forecasting models using data min- 
ing technique; and (2) to display how to predict 
crude oil prices using the proposed models. In 
view of the two objectives, this chapter mainly 
describes the building process of two data mining 
forecasting approaches and the application of these 
forecasting methods in crude oil price prediction, 
while comparing the forecasting performance with 
different evaluation criteria. 

In this chapter, we highlight data mining 
techniques for crude oil price prediction. The 
rest of this chapter is organized as follows. Next 
we describe data mining techniques for crude 
oil prediction. The building process of monthly 
crude oil prices forecasting models using data 
mining methods are then proposed. After that 
experimental analysis and comparison are given 
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in details. Finally some concluding remarks and 
future work are presented. 


DATA MINING METHODS 
AS A CRUDE OIL PRICE 
FORECASTING TOOL 


With the complexity of organizational and gov- 
ernmental growing, it is more and more important 
how to pick out relevant or evident information 
for organizational and governmental purposes. To 
support the corporate decisions, we should create 
systems and procedures to explore scenarios based 
on quantitative and/or qualitative information. 
For crude oil price forecasting, the traditional 
time series models and econometric models can 
not satisfy the practical need. The main reason 
is that these models are based on linear assump- 
tions, but oil prices appear highly nonlinear and 
even chaotic. Hence it is emerging as an impor- 
tant problem to explore more efficient methods 
and models for crude oil price forecasting. Data 
mining techniques provide an immediate alterna- 
tive to construct reasonable oil price forecasting 
model, which can analyze and forecast crude oil 
price efficiently and accurately by capturing the 
nonlinear patterns hidden in the crude oil price 
series. Data mining also plays an important role 
in improving the forecast accuracy. 

Data mining is the exploration analysis of large 
quantities of data in order to discover implicit, 
previously unknown, and potentially useful infor- 
mation (Berry and Linoff, 2004). The main idea 
is to build computer programs that sift through 
databases automatically, seeking regularities or 
patterns. Strong patterns, if found, will likely 
generalize to make accurate predictions on future 
data (Witten and Frank, 1999). Data mining is a 
multidisciplinary field drawing works from sta- 
tistics, database technology, artificial intelligence, 
pattern recognition, machine learning, information 
and data visualization, and it has been used to 


implement the tasks of classification, estimation, 
clustering, profiling and prediction. 

Among the numerous data mining techniques, 
the ANN and SVM have been widely used in the 
field of forecasting. ANN is often regarded as a 
class of reliable and cost-effective methods for 
crude oil price prediction. The neural net work 
model can be trained to approximate any smooth 
and measurable nonlinear function without prior 
assumptions on the original data (Yu etal., 2007b); 
ithas produced many promising results in this field 
of crude oil price prediction (Kaboudan, 2001; 
Mirmirani and Li, 2004; Wang et al., 2004, 2005; 
Shambora and Rossiter, 2007; Yu et al., 2007a, 
2008). These studies show that ANN models are 
very effective in simulating and describing the 
dynamics of non-stationary time series due to its 
unique non-parametric, noise-tolerant and highly 
adaptive characteristics. 

However, the inherent drawbacks of ANN 
models, e.g., local minima, over-fitting, poor 
generalization performance and the difficulty of 
determining appropriate network architectures, 
hinder practical applications of ANN models. 
Support vector machine (SVM), first proposed 
by Vapnik (1 995), provides a class of competitive 
learning algorithms to improve generalization 
performance of neural networks and achieve 
global optimum solutions simultaneously. SVM 
is a very specific type of learning algorithm char- 
acterized by the capacity control of the decision 
function, use of kernel functions, and sparsity of 
the solution (Vapnik, 1995, 1999; Cristianini and 
Taylor, 2000). Established on the unique theory of 
the structural risk minimization (SRM) principle 
to estimate a function by minimizing an upper 
bound of the generalization error, SVM is resis- 
tant to the over-fitting problem and can simulate 
nonlinear relations in an efficient and stable way. 
This property leads to a better generalization than 
conventional methods. Furthermore, SVM is 
trained as a convex optimization problem, result- 
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Figure 1. A procedure of ANN/SVM-based time series forecasting 


Phase | 
Phase Il 
Determination of ANN/SVM 
architecture 
Phase Ill 


Phase IV l Simulation & Prediction bet 


ing in a global solution that in many cases yields 
unique solutions. 

When time series prediction is conducted by 
ANNsorSVMs, input vector {x} tothe ANN/SVM 
is a finite set of consecutive measurements of the 
series x= (x(1), x(-1), ...,x(¢-s)), with time-delays, 
which isa sliding window for the input vector. The 
output of the model is x(t+h) where h is the predic- 
tion horizon and it is a user-specified parameter. 
The procedure of developing an ANN/SVM-based 
time series prediction is illustrated in Figure 1. 

As can be seen from Figure 1, the procedure 
of ANN/SVM-based time series prediction 
model can be divided into four phases, briefly 
described as following phases: 

Phase I: Data Sampling. In situations where 
there are vast volumes of data to sift through, 
a process called data sampling can help mini- 
mize data processing and significantly reduce 
computational costs. Data sampling is a process 
whereby a statistically representative portion of 
the information is examined to determine if it 
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Data preparation 


Data Preprocessing 


ANN/SVM Training 


Out-of-Sample 
Forecasting 


contains responsive data. Using data sampling 
can help narrow the research focus, for example, 
by determining whether there are time periods in 
which relevant events do not exist; this makes it 
unnecessary to process or review that particular 
part of the data set. To develop a SVM-based model 
for forecasting, different data should be collected, 
and data collected from various sources must be 
selected in terms of some specific criteria. 

For crude oil price, there are a variety of data 
used for this research. West Texas Intermediate 
(WTI) and Brent crude oil prices are two main 
crude oil price benchmarks. From the viewpoint 
of data type, spot prices and futures prices are 
available. From the point of data frequency, daily, 
weekly, monthly, quarterly, and yearly data can 
be used. The main purpose of data sampling is to 
select a representative data for further processing 
and analysis. 

Phase II: Data Preprocessing. After data 
sampling, the next task is data preprocessing. It 
includes two steps: data normalization and data 
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division. In any model development process, fa- 
miliarity with the available data is of the utmost 
importance. ANN models and SVM models are 
no exception. Data normalization can have a 
significant effect on models’ performances. After 
that, normalized data should be divided into two 
subsets: in-sample data and out-of-sample data, 
to be used for model estimation and model evalu- 
ation and verification respectively. 

Phase III: ANN/SVM Training. After the 
data is preprocessed, ANN/SVM training can be 
performed using the processed data. In this phase, 
there are three main tasks: determination of ANN 
structure/SVM input vector, sample learning, and 
model validation. For ANN models, the architec- 
ture and parameters: i.e., learning rate, momentum, 
and architecture, must be decided firstly. There are 
no criteria in deciding the parameters other than 
a trial-and-error basis. Then all weights should 
be initialized randomly. The training for ANN 
model will be ended where the stopping criterion 
is either the number of iterations reached or when 
the total sum of squares of error is lower than a 
predetermined value. Usually, the SVM input vec- 
tor is determined by time-delay s via the trial and 
error method. In sample learning, regularization 
constant C, suitable kernel functions K(.), and 
asSociated kernel parameters in kernel functions 
should be determined. Often they are determined 
by trial and error because there are no universal 
criteria for deciding the parameters. As an alter- 
native, some search-based methods, such as grid 
search and direct search methods can also be used 
to determine the ANN or SVM parameters. After 
training, model validation must be performed so 
as to guarantee the generalizability of models. 
After validation, an ANN or SVM predictor with 
optimal parameters can be obtained. 

Phase IV: Out-of-Sample Forecasting. Using 
the optimal ANN/SVM predictor, the trained 
ANN/SVM can be used for out-of-sample time 
series prediction. 


BUILDING PROCESS OF 
MONTHLY CRUDE OIL PRICE 
FORECASTING MODELS 


Model A: A Hybrid Wavelet 
Decomposition and LSSVM Model 
for Crude Oil Price Forecasting 


In this section, a hybrid model (W-LSSVM) for 
crude oil price forecasting is proposed by integrat- 
ing the wavelet and LSSVM. The formulation of 
this proposed model is composed of three stages. 
In the first stage, original crude oil price series is 
decomposed into several sub-series (approximated 
series and several detailed series) by Haar a trous 
wavelet transform, each of which has distinct 
contributions to the original series. In the second 
stage, each sub-series is predicted with LSSVM 
individually. In the final stage, crude oil price 
forecast is obtained by reconstructing the sub- 
series’ forecasts. 


Wavelet Decomposition of 
Original Time Series 


In wavelet decomposition, a redundant Haar a 
trous algorithm, which is un-decimated wavelet 
transform, can get more complete characteristics 
of the analyzed series. Thus it produces more 
precise information for localization (Shensa, 
1992). The non-decimated Haar algorithm with 
the low-pass filter “(+ , + ), provides a convincing 


TIA 


solution to troublesome time series boundary ef- 


fects at the time point ż. Since A(},%) is non- 


symmetric, calculation ofscaling coefficients and 
wavelet coefficients at time ¢ uses information 
before time ź only. This is a very desirable feature 
in time series prediction. 

The non-decimated Haar wavelet is presented 
as follows: 

First, the original time series C, is decomposed 
into an approximation component C, and an ac- 
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companying detail component W. C, may be 
created from C, by convolving the latter with h( 


1,2) 


[C, (k) + C, (k —1)] ial 


L 
2 


Wavelet coefficients W, are obtained from the 
difference between approximate coefficients C 
and C,, which can capture small features in the data. 


W (k) = C, (k) — C, (k) (2) 


The decomposition process is then iterated with 
successive approximation C, G=1, 2,...,N-1) be- 
ing decomposed in turns, until the approximation 
C, is smooth, where N is the decomposed level. 


Cyl) = HC E) + C,(k -9) 3) 


+l 


W, (k) = C,(k) — C, (k) (4) 


j+ J j+ 


Thus, the original series is decomposed into 
N detail series W, j=1,2,..., N, and an approxi- 
mation series C,. The original time series P(t) = 
C,@ can be reconstructed by summing up all the 
decomposed sub-series on multiple scales. 


Sub-Series Forecasting with LSSVM 


Suppose x, the value of a time series at time £, is 
related to its historical data, and the relation can 
be expressed by an unknown function J). The 
time series is converted into a state-vector repre- 
sentation {(X, y), XER* yER}, using the embed- 
ding dimension k (Cao, 1997), here 


Os) Ae aon om 


Suykens etal. presented the LSSVM approach, 
in which the following function is used to ap- 
proximate the unknown function f(X). 
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W(X) =w H(X) +5 (5) 


where g is a nonlinear function which maps the 
input space into a higher dimension feature space. 

Given training data, LSSVM defines an opti- 
mization problem as follows: 


N 
min J(w,e) = w'w+15>le,,(y>0) ©) 


k=1 
Subject to the equality constrains 


y,= wO(X,)+b+e,, k=, ae iv (7) 


J 
Having solved this optimization problem, the 
resulting LS-SVM model for regression can be 
expressed as follows: 


W(X) = aK Eeh (8) 
K(X, X,) = (X)"(X,) (9) 


Where, K(X, X) is defined as kernel function. 
which can be any symmetric function satisfying 
Mercer’s condition. 

Each sub-series are predicted with LSSVM 
model individually. Considering the strong nonlin- 
earity, the Radial Basis Function kernel is selected 
as the kernel function for LSSVM in this chapter. 


Time Series Forecasting 
Reconstruction 


The predicted values for each detail series are 
expressed as W(t), J=1,2,..., N, and the pre- 


dicted values for the approximation series is ex- 
pressed as C, (rt). The time series forecasting is 


obtained by reconstructing the results of these 
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Figure 2. The typical structure of WNN 


Inputs 


manner. 
P(t) = aC, (t) + > dW, (0) (10) 


Parameters aand b are obtained by least square 
regression with training sample set. 


P(t) = aC, (t)+ o W(t) (11) 


Although many other reconstruction tech- 
niques, such as nonlinear reconstruction by an 
ANN, or SVM, are also available, here we just use 
the linear additive reconstruction for simplicity. 


Model B: Forecasting Crude 
Oil Spot Price by Wavelet 
Neural Network Using OECD 
Petroleum Inventory Levels 


In this section, a wavelet neural network (WNN) 
based forecasting model is proposed to predict 
crude oil price, which can depict the nonlinear 
relationships among variables. WNN, which 


Hidden Layer 


Outputs 


combines the wavelet analysis and feed-forward 
neural networks, shows surprising effectiveness 
in solving the conventional problems of poor 
convergence, or even divergence, encountered 
in other kinds of neural networks (Zhang et al., 
1992 and Khayamian et al., 2005). The WNN 
consists of three layers: input layer, hidden layer 
and output layer, as illustrated in Figure 2. 


Selection of Model Variables 


Anumber of factors were considered for crude oil 
price forecasting model. The first one is inven- 
tory. Petroleum inventory levels are a measure 
of the balance or imbalance between petroleum 
production and demand, which can reflect volatile 
market pressures on crude oil prices, and thus 
provide a good market barometer of crude oil 
price fluctuation in the short run. Relationship 
between petroleum inventory levels and crude oil 
price has been studied by many researchers and 
be found nonlinear (Dale, 1997; Timothy, 2000 
and Saif Ghouri, 2006). More specifically, Ye et 
al. (2005) built a linear forecasting model, using 
relative inventories, to forecast the WTI crude oil 
price. At the same time, since inventories have a 
lower bound or a minimum operating level, some 
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Figure 3. Crude oil spot price and total OECD inventory 
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economic literature, such as Deaton and Larque 
(1992), Miranda and Glauber (1993), Chambers 
and Bailey (1996), Michaelides and Ng (2000), 
and Routledge et al. (2000) suggest a relationship 
between inventory levels and commodity prices. 

The proposed model focuses the research on 
the nonlinear relation between OECD inventory 
and WTI crude oil spot prices. The Figure 2 shows 
the correlation between the behavior of WTI crude 
oil spot price and the total OECD inventories from 
1992 to 2003. As is shown in Figure. 3, the large 
swings in WTI spot price during the late 1990s are 
coupled with counter-swings in inventory. Dur- 
ing this period, when the total inventory dropped 
from 3991 to 3943 million barrels in March 1999, 
WTI spot prices rose from $14.68 to $17.31 per 
barrel in the next month. The same situation also 
occurred in February 2003. It is obvious that 
inventory can reflect the change of petroleum 
production and demand to some extent. Hence 
OECD industrial petroleum inventory levels is 
involved in the model. 

We also found that WTI price is influenced by 
its former value, consistent with those findings 
in other researchers’ papers. So in order forecast 
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the WTI price better, we can include the lags of 
WTI in the model. 

Besides inventory and the lag of WTI, we 
should consider some dummy variables in out 
model, such as events that significantly impacted 
crude oil markets which may lead to structural 
change or market disequilibrium. 

The variables and their lags which will be 
used in nonlinear model can be confirmed by 
multivariable linear model, as shown in the fol- 
lowing equation, 


n m g 
WTI, =a+ 2 bJINVENTORY,, + as ae + D d Dummy, +€ 
(12) 


From model estimation results we can get the 
variables which are significant and then use them 
in a nonlinear model. 


WNN Approach for Oil 
Price Series Modeling 


The detailed steps of the new WNN-based model 
are as follows: 
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First, analyzed the variables which could in- 
fluence the variable being forecasted, then build 
multivariable linear models to adjust and confirm 
the variables which can be used in nonlinear model 
and determine the number of lag variables at the 
same time based on the model estimation results. 

Then, initialize network parameters, deter- 
mine the parameters of the WNN-based model 
including WNN step size for learning, neurons in 
a single hidden layer, generalized delta learning 
rule, wavelet transfer function and the number of 
iterations. Initialization of network parameters is 
an important issue during WNN modeling which 
contributes on the convergence of the model. In 
this study, an appropriate initialization is selected 
which is presented by Oussar et al (1998). Suppos- 
ing the input vectors ranges in the domain /x 


X „a then the initial values of the ith eee ae 
translation and scaling parameters are set to b; 
=0(0.5 (X, min + Xi J and a 0.2 (%; min tad J. to 
guarantee that wavelets will not concentrate on 
localities ofthe input universe in a certain extent. 
Selection of the number of nodes in the hidden 
layer of WNN is another important issue. If the 
number is too small, WNN may not reflect the 
complex function relationship between input data 
and output value. By contrary, a larger one may 
Fesult in so complex a network with a very large 
output error caused by overfitting of the training 
sample set. Usually, the selection ofthe number of 
neurons in the hidden layer was made according 
to the previous experience; it was shown that the 
increase of the number of neurons in the hidden 
layer did not improve performance and general- 
ization of the WNN. In the study, we select the 
network with 8 neurons in the hidden layer after 
checking various options. 

Throughout the training, step size for learning 
is 0.01, and the Morlet wavelet transfer function 
is considered. The number of iterations is 1000. 
All the parameters are listed in Table 1. 

Third, set the training samples and testing 
samples. Get WNN-based training and testing 


Table 1. Parameters setting in WNN model 


Variables 


results using the final variables and parameters. 
We train the model with data of training samples 
to forecast the WTIt+ 1, then actual value of t+ 
was added, and the model was re-fitted to forecast 
the price of WTIt+2.The results are averages of 
value of experiments repeated certain times such 
as twenty. In the WNN, the following steps are 
carried out: 


1. Initializing the dilation parameter a, trans- 
lation parameter b, and node connection 

weights u,, w, to some random values. All 
those random values are limited in the in- 
terval (0, 1). 

2.  Inputting datax (7) and corresponding output 


values aa where the superscript T repre- 


sents the target output state. 
3. Propagating the initial signal forward 
through the network: 


v = So wh|=+}—_—___ (13) 


Zi a 


~ 


where A is taken as a Morlet wavelet 


h (1) = Cos (1 .75t) exp 


t : 
-£ (14) 
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4. Calculating the WNN parameter: 


Aw, = -n La +aAw, 
Ow 


t 


OE 
Au. = —n— + aAu. 
ti 1oy ti 


ti 


Aa, = =N = + ada, 


t 


Ab = -n + aAb,_, the error function 


f 


N 
Eistakenas E = ‚|$ (v? —v ) ,and v7 
n=l 
v are the experimental and calculated values, 
respectively. X stands for the data number 
of training sets, and 7 and a are the learning 
rate and the momentum term, respectively. 
5. The WNN parameters are changed until the 
network output satisfies the error criteria. 


EXPERIMENTAL ANALYSIS 


In this section, the experimental results of the 
proposed Model A and Model B are presented. 
First of all, we describe the data source and the 
evaluation criteria used in this study and then 
report the experimental results, respectively. 

In this study, we select monthly nominal West 
Texas Intermediate (WTI) crude oil spot prices in 
our experiments, which plays a significant role 
and are usually considered as a world benchmark 
price in the international crude oil markets. All 
the data comes from Energy Information Admin- 
istration (EIA). 

Two typical criteria are employed to assess 
and compare the in-sample and out-of-sample 
forecasting ability of the proposed forecasting 
models and others in this study, which are root 
mean squared error (RMSE) and directional sta- 
tistics (D „) respectively. Given N pairs of the 


stat 
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actual values (or targets, x ) and predicted values 
(x, ), the RMSE can be defined as 


RMSE = |} So (& - x)? (15) 


Clearly, the RMSE is a quadratic scoring rule 
which measures the average magnitude of the 
error. Since the errors are squared before they are 
averaged, the RMSE givesa relatively high weight 
to large errors. This means the RMSE is most 
useful when large errors are particularly undesir- 
able. But in oil price forecasting, correct forecast 
ofmovement directions or turning points between 
the actual and predicted values, x, and x,, is also 


of great importance. The ability to predict move- 
ment direction or turning points can be measured 
by a statistic developed by Yao and Tan (2000). 


Directional change statistics (D au) can be ex- 
pressed as i 
1 N 
at = = >a, X 100% (16) 
; N ga 


wherea=1 if (y,,, — y,)(¥,,, — y,) = 0,anda=0 


otherwise, and N is the number of the testing 
samples. 


Experimental Results of Model A 


Data Description and 
Structural Breaks Test 


The available WTI data covers the period from 
January 1986 to July 2007. In order tomakea good 
prediction, a structural break testing is performed 
with the iterated cumulative sums of squares al- 
gorithm (ICSS) (Inclan, Carla, and Tiao, 1994). 
Structural break is a kind of nonstationarity. 
It is often caused by changes in the structure of 
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the economy, industry, and events that change the 
dynamics of specific industries or firm related 
guantities, such as inventories, sales, and produc- 
tion, etc. The breaks often lead to a misleading 
inference or forecasting if they are neglected in 
the models used. Therefore, the structure break 
testing is to help restrict the training and testing 
sample to the same structure period. In this study, 
the BP multiple structure break point test method, 
developed by Bai and Perron (2003) based on 
sum of squared residuals minimization criterion, 
is used to detect the number and timing of breaks 
for oil price series. 

Four breaks including 1987.01, 1990.08, 
1991.02, 1998.09, are found in the WTI oil price 
from January 1986 to July 2007. Based on these 
breaks, the whole WTI oil price data can be divided 
into five sections: 1986/01-1986/12 (12 months), 
1987/01-1990/07 (43 months), 1990/08-1991/01 
(6 months), 1991/02-1998/08 (91 months) and 
1998/09-2007/07 (107 months). 

Because the former three sections are too 
short to be suitable for model training, we select 
the latest two sections for the sample sets in the 
experiment. 

Sample set 1: 1991/02-1998/08. The training 
setis from 1991/02 to 1995/12 with 59 data points, 
and the testing set is from 1996/01 to 1998/08 
with 32 data points. 

Sample set 2: 1998/09-2007/07. The training 
setis from 1998/09 to 2004/12 with 76 data points, 
and the testing set is from 2005/01 to 2007/07 
with 31 data points. 


Crude Oil Price Series Decomposition 


The Haar à trous transform provides a convincing 
and computationally straightforward solution to 
troublesome time series boundary effects at the 
time pointt. The calculation of scaling coefficients 
and wavelet coefficients at time tuses information 
before time f only. So it is not necessary to repeat 
the decomposition process for each prediction, 


but select the whole sample data, Sample set lor 
Sample set 2, to decompose only once. Both the 
Figure 4 and Figure 5 illustrate the decomposi- 
tion results of sample set 1 and sample set 2, 
respectively. As can be seen that there are one 
approximation and four detail components for 
each oil price series. 


Training and Testing Results 


To evaluate the performance of W-LSSVM, 
wavelet-based multi-scale ARIMA (W-ARIMA) 
model and two single-scale models, ARIMA and 
LSSVM, are used for the comparison analysis. 
For testing sets of Sample set 1 and Sample set 
2, the Figure 6 and Figure 7 present the forecast- 
ing results. In order to compare the multi-scale 
model and the single-scale model clearly, four 
models are shown in two sub-figures separately, 
for each test sample. 

For example, as can be seen from Figure 6, 
the curve of the crude oil price forecast of W- 
LSSVM is closer to the curve of the actual time 
series than that of LSSVM. By comparing W- 
ARIMA and ARIMA, the same result is obtained. 
Itis shown that the multi-scale models outperform 
single-scale models. 

The final results are summarized in Table 2 
and Table 3. MAPE, RMSE of multi-scale models 
are smaller than relevant single-scale models. In 
additional, the hit rate D „ is heightened obvi- 
ously. It proves that multi-scale decomposition 
improves prediction accuracy. By comparing the 
performance of linear and nonlinear models, it 
is found that nonlinear models behave generally 
better than linear models in most cases. So the 
multi-scale model W-LSSVM is the best among 
all the models among the four models. 

To fully integrate the advantages of several 
models, a simple combination forecasting is pre- 
sented as follows, 
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Figure 4. Decomposition of sample set 1 
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Poe ou Bio (17) Experimental Results of Model B 
i=l 


Data Preparation and 


The performance of the combination forecast Statistical Analysis 


model is also included in Table 3. The study 
shows the combination model is better than any 
individual model, for crude oil price forecasting. 


Because OECD inventory data are only available 
monthly, we use monthly WTI crudeoil spot prices 
and OECD inventories in this study. All the data 
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Figure 6. Comparison for testing set 1 
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comes from Energy Information Administration 
(EIA). ; 

In order to compare the performance of the 
proposed model with other similar ones and test 
the stability of the new model, two sample sets 
are selected from WTI data: 

Sample set 1: January 1992 ~ September 2003, 
the training set is January 1992 ~ September 2002 
and testing set is October 2002 ~ September 2003. 

Sample set 2: January 1992 ~ September 2006, 
the training set is January 1992 ~ September 2003 
and testing set is October 2003 ~ August 2006. 

Between 1992 and 2003, there were two sig- 
nificant events which had made a great impact 
on crude oil market. One was “OPEC quota 
tightening” at the beginning of April 1999, which 


apr 


AFO Wa ptas Maig Saia 


a o Naat tas Maig Gac 


resulted in the deviation and mean value of the 
WTI price changing, as a whole. The other was 
“911 terrorist attacks” in USA in 2001, which 
also lead to violent market disequilibrium. Both 
events are considered in the existed model of Ye 
et al., the same case with this chapter. 

First, we present a multivariable linear model 
to confirm the variables which can be involved 
in a nonlinear model. The model is shown in Eq. 
(18). Note that subscript ¢ in the model is for the 
tth month; subscript i is for ith month prior to the 
tth month; a, b, C, dand e, are coefficients to be 


estimated; k = 0, 1, 2,..., 5 refer to six months 
from October 2001 to March 2002; M, and LAPR 


are variables to account for market disequilibrium 
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Figure 7. Comparison for testing set 2 
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Table 2. The RMSE comparison with different Table 3. The D „comparison with different fore- 
forecasting models casting models 


and shifting caused by the two events mentioned From the estimation results for Eq. (18), we 
in the previous paragraph. can see that only inventory with one-order lag, 
WTI with one-order lag, LAPR and M, (dummy 

WTI, =a +5 b INVENTORY F +E, +adLAPR+Y`e,M, +e, variables of 911 terrorist attacks”) are Significant. 
i=l ial 0 But according to the data, we found that different 


(18) inventory changes lead to significantly different 
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movements in price in the next month and even 
similar changes in inventory can result in diverse 
changes in price. For example, inventory increased 
from 3651 to 3704 million barrels in May 1992, 
and WTI price increased by $1.4 per barrel in 
June 1992. While in November 2004, inventory 
increased from 4012 to 4065 million barrels which 
also rises 53 million barrels, but WTI price de- 
creased by $5.32 per barrel in the next month. So 
we use the following model to explore a more 
consistent relationship between WTI, and 
AINVENTORY __,, 


4 
WTI =a+ y bAINVENTORY_+cWTI_,+€ 
t i i-i t-l 


i=l 


(19) 


where ADDN V ESN L OR Korg 
= INVENTORY _, — INVEN TORY ccs .Fromthe 


estimation results for Eq. (19), it is easy to find 
that the results support the above findings about 
the relationship between changes in inventory and 
oil price. According to the results of the second 
multivariable linear model, we formulate model 
variables to one-order lag of WTI, one-order lag 
of Inventory, AINVENTORY_,, and dummy 


variables of the two events as the input variables 
of our WNN-based nonlinear model. 

In the process of modeling, however, we 
found that the dummy variable of “911 terrorist 
attacks” can not improve forecasting accuracy 
and slow down the constringency velocity of the 
WNN program. So it is eliminated from the set of 
variables. The input variables of our WNN-based 
nonlinear model are shown in Table 4. 


Training and Testing Results 


We take INVENTORY, ,, INVENTORY, , ,WTI., 


and APR99 as input nodes of the WNN model. 
The WNN forecast program is run on Matlab 7.0. 
Specially, APR99 is represented by a binary form, 


Table 4. Variables of the forecasting model 


INVENTORY,, | One lag of Inventory 
INVENTORY,, | Two lag of Inventory 
One lag of WTI monthly spot price 


Dummy variable capturing a structural 
change in April 1999 


i.e. the values before April 1999 are set to 0, and 
the others are 1. 

The performance of the proposed model 
is compared with Ye’s linear model (Ye et. al. 
2005) and nonlinear model (Ye et al. 2006). For 
the training sets of Sample set 1 and Sample set 
2, Table 5 shows the RMSE comparison results 
with different forecasting models. Meanwhile, 
Directional change statistics (D „) ofthe proposed 
method is given in Table 6. As can be seen from 
Table 5, WNN model performs the best among 
three models, which has the lowest RMSE, not 
only on Sample set 1, but also Sample set 2. The 
new model can also get higher Directional change 
statistics, which is over 70% for both sample sets 
and shown in Table 6. It is noted that two statistics 
of WNN model (RMSE and D „) are average of 
experiments results run 20 times with different 
initial parameters. 

The evaluations of testing results for Sample 
set 1 are shown in Table 7. For consistent com- 
parison with Ye’s models, Mean, MAE (Mean 
Absolute Error) and St. Dev (standard deviation) 
are involved in the following comparison. The 
testing procedure of sample set 1 begins by train- 
ing the model with data from January 1992 to 
September 2002, and forecasting the value of 
October 2002. Then actual value of October 2002 
was added, and the model was re-fitted to forecast 
the price for November 2002. The process is re- 
peated until the value of September 2003 is pre- 
dicted. 

From the performance comparisons of various 
models shown in Table 7, the prediction perfor- 
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Table 5. The RMSE of training results for two 
sample sets 


Table 6. D „oftraining results for two sample sets 


| | Sample set Sample Ser 2 
12% 71.8% 


mance of the WNN-based nonlinear forecasting 
model is better than those of other models. RMSE, 
MAE, St. Dev and Mean indicators of prediction 
errors are all decreased for testing results. Spe- 
cially, RMSE for WNN is almost half of nonlinear 
RI model. The decreased standard deviation of 
prediction error can also prove the great stability 
of the proposed model. This demonstrates that 
the WNN model performs very well for oil price 
forecasting even with sharp fluctuation, like in 
February 2003. 

‘The evaluations of testing results for Sample 
set 2 are shown in Table 8 and the D „ for two 
samples shown in Table 9. As can be seen from 
Table 8, RMSE of sample set 2 is higher than one of 
sample set 1. The main reason is that the oil price 
rose from $34.31 in Jan. 2004 to $74.41 per barrel 
in July 2006, which has a 217% increase in the 
short two years. In this period, there are 11 months 
in which oil price rose more than 8%. Specially, 
the price rose more than 13% in October 2004, 
March 2005, and June 2005. Hence, sucha violent 
fluctuation leads to a relative higher prediction 
error. The D „ value in Table 9 proves that our 
model can depict the direction of the WTI price 
changes very well, especially for Sample set 2. 

From the experiments presented in this study, 
we can draw the following conclusions: the rela- 
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Table 7. Testing results comparisons of Sample 
set 1 


| Models | WNN | 


Linear RI 


Table 8. Testing results of Sample set 2 


Statistics 


Table 9. D,_ oftesting results of the two sample sets 


Slat 


Sample Set 2 


tionship between oil price and inventory is non- 
linear, and WNN can reveal this nonlinear rela- 
tionship. In terms of the empirical results, we 
conclude that a nonlinear model can be used as 
an alternative tool for oil price forecasting, to 
obtain higher forecasting accuracy and improve 
the prediction quality further. 


CONCLUSION AND 
FUTURE DIRECTIONS 


This chapter proposes two data mining based 
models for crude oil price analysis and forecasting 
to obtain accurate prediction results and improve 
prediction quality further. Specially, we have 
demonstrated that the relationship between oil 
price and inventory is nonlinear and WNN can 
reveal this nonlinear relationship. Also, SVM 
can capture the nonlinear structure of each de- 
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composed subseries of oil price. The experiments 
show that nonlinear models based on WNN and 
SVM are more effective for forecasting crude oil 
orice compared with other models like ARIMA. 

As can be seen from this study, data mining 
sased technologies show more promising perfor- 
mance than the time series model and econometric 
methods, which can be used as an alternative 
‘ol for crude oil price analysis and forecasting. 
Meanwhile, some other data mining techniques 
zan also be introduced to this field, for example, 
cluster analysis and association rules mining. 
The high volatility and irregularity of crude oil 
market creates uncertainty, mainly because of the 
ateraction of many factors in crude oil markets. 

Further, how to analyze and use these factors to 
forecast the crude oil price has attracted increas- 
ag attention from academics and practitioners. 
Cluster analysis can be used to study all kinds of 
factors affecting crude oil price, like war, climate, 
speculation, foreign exchange, and so on. Mean- 
while, association rules mining can be introduced 
to extract decision rules about oil price and these 
factors. Such rules can help the decision maker 
save a good knowledge of the reason of great 
volatility and contribute their important invest- 
ment decisions. Other data mining techniques for 
crude oil market can be worth exploring further 
a the future. 


ACKNOWLEDGMENT 


This work is supported by the National Natural Sci- 
nce Foundation of China (NSFC No. 70801058). 


REFERENCES 


Abramson, B., & Finizza, A. (1991). Using 
elief networks to forecast oil prices. Jnterna- 
ional Journal of Forecasting, 7(3), 299-315. 
Soi: 10.1016/0169-2070(91)90004-F 


Abramson, B., & Finizza, A. (1995). Probabi- 
listic forecasts from probabilistic models: a case 
study in the oil market. International Journal of 
Forecasting, 11(1), 63-72. doi:10.1016/0169- 
2070(94)02004-9 


Adrangi, B., Chatrath, A., Dhanda, K. K., & Raf- 
fiee, K. (2001). Chaos in oil prices? Evidence from 
futures markets. Energy Economics, 23,405—425. 
doi:10.1016/S0140-9883(00)00079-7 


Al Faris, A. (1991). The determinants of crude oil 
price adjustment in the world petroleum market. 
OPEC Review, 15. 


Bacon, R. (1991). Modelling the price of oil. 
Oxford Review of Economic Policy, 7(2), 17-34. 
doi:10.1093/oxrep/7.2.17 


Barone-Adesi, G., Bourgoin, F., & Giannopoulos, 
K. (1998). Don’t look back. Risk (Concord, NH), 
(August): 100-107. 


Chambers, M. J., & Bailey, R. J. (1996). A 
theory of commodity price fluctuations. The 
Journal of Political Economy, 104(5), 924-957. 
doi:10.1086/262047 


Dale, C., & Zyren, J. (1997). Petroleum futures 
markets: volatile prices, controversial functions, 
stagnant volumes. In Petroleum 1996: Issues and 
Trends, (pp. 92-100). Washington, DC: DOE/ 
EIA-0615. 


Deaton, A., & Larque, G. (1992). On the behavior 
of commodity prices. The Review of Economic 
Studies, 59(198), 1-23. doi:10.2307/2297923 


Fan, Y., Liang, Q., & Wei, Y. M. (2008). A gener- 
alized pattern matching approach for multi-step 
prediction of crude oil price. Energy Economics, 
30, 889-904. doi:10.1016/j.eneco.2006.10.012 


Gulen, S. G. (1998). Efficiency in the crude oil 
futures markets. Journal of Energy Finance & 
Development, 3(1), 13—21. doi:10.1016/S1085- 
7443(99)80065-9 


201 


Data Mining Methods for Crude Oil Market Analysis and Forecast 


Huntington, H. G. (1994). Oil price forecasting in 
the 1980s: what went wrong? The EnergyJournal 
(Cambridge, Mass.), 15(2), 1-22. 


Kaboudan, M. A. (2001). Compumetric forecast- 
ing of crude oil prices. In The Proceedings of 
IEEE Congress on Evolutionary Computation, 
(pp. 283-287). 


Lanza, A., Manera, M., & Giovannini, M. (2005). 
Modeling and forecasting cointegrated rela- 
tionships among heavy oil and product prices. 
Energy Economics, 27, 831—848. doi:10.1016/j. 
eneco.2005.07.001 


Michaelides, A., & Ng, S. (2000). Estimating the 
rational expectations model of speculative stor- 
age: a Monte Carlo comparison of three simula- 


tion estimators. Journal of Econometrics, 96(2), 
23 1-266. doi:10.1016/S0304-4076(99)00058-5 


Miranda, M. J., & Glauber, J. W. (1993). Estima- 
tion of dynamic nonlinear rational expectations 
models of primary commodity markets with pri- 


vate and goysmmsmt sookholding, The Review 
of Economics and Statistics, 75(3), 463-470. 
doi:10.2307/2109460 


Mirmirani, S., & Li, H. C. (2004). A comparison 
of VAR and neural networks with genetic algo- 
rithm in forecasting price of oil. Advances in 
Econometrics, 19, 203—223. doi:10.1016/S073 1- 
9053(04)19008-7 


Morana, C. (2001). A semiparametric approach 
to short-term oil price forecasting. Energy Eco- 
nomics, 23(3), 325-338. doi:10.1016/S0140- 
9883(00)00075-X 


Nelson, Y. S., Stoner, G., & Gemis, H. D. (1994). 
Nix: Results of Delphi VIII survey of oil price 
forecasts. Energy Report, California Energy 
Commission. 


202 


Panas, E., & Ninni, V. (2000). Are oil markets 
chaotic? A non-linear dynamic analysis. Energy 
Economics, 22, 549-568. doi:10.1016/S0140- 
9883(00)00049-9 


Routledge, B., Seppi, D., & Spatt, C. (2000). 
Equilibrium Forward Curves for Commodities. 
The Journal of Finance, 55(3), 1297-1338. 
doi:10.1111/0022-1082.00248 


Sadorsky, P. (2006). Modeling and forecasting 
petroleum futures volatility. Energy Economics, 
28, 467-488. doi:10.1016/j.eneco.2006.04.005 


Saif Ghouri, S. (2006). Assessment of the rela- 
tionship between oil prices and US oil stocks. 
Energy Policy, 34(17), 3327-3333. doi:10.1016/j. 
enpol.2005.07.007 


Shambora, W. E., & Rossiter, R. (2007). Are there 
exploitable inefficiencies in the futures market for 
oil? Energy Economics, 29, 18—27. doi:10.1016/j. 
eneco.2005.09.004 


Timothy, J. C., & Eunnyeong, H. (2000). Price and 
inventory dynamics in petroleum product markets. 
Energy Economics, 22(5),527—548. doi:10.1016/ 
S0140-9883(00)00056-6 


Xie, W., Yu, L., Xu, S. Y., & Wang, S. Y. (2006). 
A new method for crude oil price forecasting 
based on support vector machines, (. LNCS, 
3994, 441-451. 


Ye, M., Zyren, J., & Shore, J. (2002). Forecasting 
crude oil spot price using OECD petroleum inven- 
tory levels. International Advances in Economic 
Research, 8, 324—334. doi:10.1007/BF02295507 


Ye, M., Zyren, J., & Shore, J. (2005). A monthly 
crude oil spot price forecasting model using 
relative inventories. International Journal of 
Forecasting, 21, 491—501. doi:10.1016/j.ijfore- 
cast.2005.01.001 


Data Mining Methods for Crude Oil Market Analysis and Forecast 


Ye, M., Zyren, J., & Shore, J. (2006). Forecasting 
short-run crude oil price using high and low-in- 
ventory variables. Energy Policy, 34, 2736-2743. 
doi:10.1016/j.enpol.2005.03.017 


Yu, L., Lai, K. K., Wang, S. Y., & He, K. J. 
(2007). Oil price forecasting with an EMD-based 
multiscale neural network learning paradigm, ( 
LNCS), 4489, 925—932. 


Yu, L., Wang, S. Y., & Lai, K. K. (2005). A novel 
nonlinear ensemble forecasting model incorporat- 
ing GLAR and ANN for foreign exchange rates. 
Computers & Operations Research, 32(10), 
2523-2541. doi:10.1016/j.cor.2004.06.024 


Yu, L., Wang, S. Y., & Lai, K. K. (2008). Fore- 
casting crude oil price with an EMD-based neural 
network ensemble learning paradigm. Energy 
Economics, 30, 2623-2635. doi:10.1016/j.en- 
eco.2008.05.003 


Zhang, X., Lai, K. K., & Wang, S. Y. (2008). Anew 
approach for crude oil price analysis based on Em- 
pirical Mode Decomposition. Energy Economics, 
30, 905-918. doi:10.1016/j.eneco.2007.02.012 


KEY TERMS AND DEFINITIONS 


Data Mining: Data mining is the process of 
extracting hidden patterns from data. Data min- 


ing is becoming an increasingly important tool to 
transform this data into information. 

Crude Oil Price Forecasting: Crude oil price 
forecasting is the process of judge the trend of 
crude oil price by analyzing the factors affecting 
oil price fluctuation. 

Least Square Support Vector Regression: 
Least squares support vector machines (LSSVM) 
are reformulations to the standard SVMs which 
lead to solving linear KKT systems. 

Neural Network: Traditionally, the term 
neural network (NN) had been used to refer to 
a network or circuit of biological neurons. The 
modern usage of the term often refers to artificial 
neural networks, which are composed of artificial 
neurons ornodes, which can be used for classifica- 
tion, forecast and regression and so on. 

Support Vector Machine: Support vector 
machines (SVM) are a set of related supervised 
learning methods used for classification and 
regression. 

Wavelet Neural Network: Wavelet neural 
network (WNN) is a novel approach towards the 
learning function. Wavelet networks, which com- 
bine the wavelet theory and feed-forward neural 
networks, utilize wavelets as the basic function 
to construct a network. 
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ABSTRACT 


This chapter presents a new method to analyze the link between the probabilities produced by a classi- 
fication model and the variation of its input values. The goal is to increase the predictive probability of 
a given class by exploring the possible values of the input variables taken independently. The proposed 
method is presented in a general Jramework, and then detailed Jor naive Bayesian classifiers. We also 


demonstrate the importance of “lever variables” 


, variables which can conceivably be acted upon to 


obtain specific results as represented by class probabilities, and consequently can be the target of specific 
policies. The application of the proposed method to several data sets shows that such an approach can 


lead to useful indicators. 


INTRODUCTION 


Given a database, one common task in data analysis 
is to find the relationships or correlations between a 
set of input or explanatory variables and one target 
variable. This knowledge extraction often goes 
through the building of a model which represents 
these relationships (Han & Kamber, 2006). Faced 
with a classification problem, a probabilist model 
allows, for all the instances of the database and 
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given the values of the explanatory variables, the 
estimation of the probabilities of occurrence of 
each class target. 

These probabilities, or scores, can be used to 
evaluate existing policies and practices in orga- 
nizations and governments. They are not always 
directly usable, however, as they do not give any 
indication of what action can be decided upon to 
change this evaluation. Consequently, it seems 
useful to propose a methodology which would. 
for every instance in the database, (i) identify the 
importance of the explanatory variables; (ii) identify 
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the position of the values of these explanatory 
variables; and (iii) propose an action in order to 
change the probability of the desired class. We 
propose to deal with the third point by exploring 
the model relationship between each explanatory 
variable independently from each other and the 
target variable. The proposed method presented 
in this chapter is completely automatic. 

This chapter is organized as follows: the first 
section positions the approach in relation to the 
state of the art; the second section describes the 
method at first from a generic point of view and 
then for the naive Bayes classifier. Through three 
illustrative examples the third section allows a 
discussion and a progressive interpretation of 
the obtained results. In each illustrative example 
different practical details of the proposed method 
are explored. Finally we shall conclude. 


BACKGROUND 


Machine learning abounds with methods for super- 
vised analysis in regression and/ or classification. 
Generally these methods propose algorithms to 
build a model from a training database made up 
of a finite number of examples. The output vector 
gives the predicted probability of the occurrence 
of each class label. In general, however, this 
probability of occurrence is not sufficient and an 
interpretation and analysis of the result in terms 
of correlations or relationships between input and 
output variables is needed. 

Furthermore, the interpretation of the model is 
often based on the parameters and the structure of 
the model. One can cite, for example: geometri- 
cal interpretations (Brennan & Seiford, 1987), 
interpretations based on rules (Thrun, 1995) or 
fuzzy rules (Benitez, Castro, & Requena, 1997), 
statistical tests on the coefficient’s model (Na- 
kache & Confais, 2003). Such interpretations are 
often based on averages for several instances, for 
a given model, or for a given task (regression or 
classification). 


Another approach, called sensitivity analysis, 
consists in analyzing the model as a black box 
by varying its input variables. In such “what if” 
simulations, the structure and the parameters of 
the model are important only as far as they allow 
accurate computations of dependant variables us- 
ing explanatory variables. Such an approach works 
whatever the model. A large survey of “what if” 
methods, often used for artificial neural network, 
are available in (Leray & Gallinari, 1998; Lemaire, 
Féraud, & Voisine, 2006). 


VARIABLE IMPORTANCE 


Whatever the method and the model, the goal is 
often to analyze the behavior of the model in the 
absence of one input variable, or a set of input 
variables, andto deduce the importance ofthe input 
variables, for all examples. The reader can find 
a large survey in (Guyon, 2005). The measure of 
the importance of the input variables allows the 
selection of a subset of relevant variables for a 
given problem. This selection increases the robust- 
ness of models and simplifies the understanding 
of the results delivered by the model. The variety 
of supervised learning methods, coming from the 
statistical or artificial intelligence communities 
often implies importance indicators specific to 
each model (linear regression, artificial neural 
network...). 

Another possibility is to try to study the 
importance of a variable for a given example 
and not in average for all the examples. Given a 
variable and an example, the purpose is to obtain 
the variable importance only for this example: 
for additive classifiers see (Poulin et al., 2006), 
for Probabilistic RBF Classification Network see 
(Robnik-Sikonja, Likas, Constantinopoulos, & 
Kononenko, 2009), and for a general methodol- 
ogy see (Lemaire & Féraud, 2008). If the model 
is restricted to a naive Bayes Classifier, a state 
of art is presented in (Možina, Demšar, Kattan, 
& Zupan, 2004; Robnik-Sikonja & Kononenko, 
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2008). This importance gives a specific piece of 
information linked to one example instead of an 
aggregate piece of information for all examples. 


IMPORTANCE OF THE VALUE 
OF AN INPUT VARIABLE 


To complete the importance of a variable, the 
analysis of the value of the considered variable, 
for a given example, is interesting. For example 
Féraud et al. (2002) propose to cluster examples 
and then to characterize each cluster using the 
variables importance and importance of the values 
inside every cluster. Framling (1996) uses a “what 
if’ simulation to place the value of the variable 
and the associated output of the model among 
all the potential values of the model outputs. 
This method which uses extremums and an as- 
sumption of monotonous variations of the output 
model versus the variations of the input variable 
has been improved in (Lemaire & Féraud 2008). 


INSTANCE CORRELATION 
BETWEEN AN EXPLANATORY 
VARIABLE AND THE TARGET CLASS 


This chapter proposes to complete the two as- 
pects presented above, namely the importance 
of a variable and the importance of the value of 
a variable. We propose! to study the correlation, 
for one instance and one variable, between the 
input and the output of the model. 

For a given instance, the distinct values of a 
given input variable can pull up (higher value) 
or pull down (lower value) the model output. 
The proposed idea is to analyze the relationship 
between the values of an input variable and the 
probability of occurrence of a given target class. 
The goal is to increase (or decrease) the model 
output, the target class probability, by exploring 
the different values taken by the input variable. For 
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instance for medical data one tries to decrease the 
probability of a disease; in case of cross-selling — 
one tries to increase the appetency to a product: 
and in government data cases one tries to define a 
policy to reach specific goals in terms of specific 
indicators (for example decrease the unemploy- 
ment rate). 

This method does not explore causalities. 
only correlations, and can be viewed as a method 
between: 


° selective sampling (Roy & McCallum, 
2001) or adaptive sampling (Singh, Nowak. 
& Ramanathan, 2006): the model observes 
a restricted part of the universe material- 
ized by examples but can “ask” to ex- 
plore the variation space of the descriptors 
one by one separately, to find interesting 
zones and causality exploration (Kramer. 
Leventhal, Hutchinson, & Feinstein, 1979: 
Guyon, Constantin Aliferis, & Elisseeff. 
2007): as example D. Choudat (Choudat. 
2003) propose the imputability approach te 
specify the probability of the professional 
origin of a disease. The causality prob- 
ability is, for an individual, the probability 
that his disease arose from exposures to 
professional elements. The increase of the 
risk has to be computed versus the respec- 
tive role of each possible type exposures. 
In medical applications, the models used 
are often additive models or multiplicative 
models. 


LEVER VARIABLES 


In this chapter we also advocate the definition of 
a subset of the explanatory variables, the “lever 
variables”. These lever variables are defined as the 
explanatory variables for which it is conceivable 
to change their value. In most cases, changing the 
values of some explanatory variables (such a sex. 
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age...) is indeed impossible. The exploration of 
instance correlation between the target class and 
the explanatory variables can be limited in practice 
to variables which can effectively be changed. 

The definition of these lever variables will al- 
low a faster exploration by reducing the number 
of variable to explore, and will give more intel- 
ligible and relevant results. Lever variables are the 
natural target for policies and actions designed to 
induce changes of occurrence of the desired class 
in the real world. 


CORRELATION EXPLORATION 
- METHOD DESCRIPTION 


In this section, the proposed method is first 
described in the general case, for any type of 
predictive model, and then tested on naive Bayes 
classifiers 


General Case 


Let C, be the target class among T target classes. 
Let f be the function which models the predicted 
probability of the target class f(X=x) = P(C, | 
X=x), given the equality of the vector X of the 
J explanatory variables to a given vector x of J 
values. Let Va be all the n different possible values 
of the variable X. l 

The Algorithm 1 describes the proposed 
method. This algorithm tries to increase the value 
of P(C, | X = x,) successively for each of the K 
examples of the considered sample set using the 
set of values of all the explanatory variables or 
lever variables. This method is halfway between 
selective sampling (Roy & McCallum, 2001) 
and adaptive sampling (Singh et al., 2006). The 
model observes a restricted part of the universe 
materialized by examples but can “ask” to explore 
the variation space of the descriptors one by one 
separately, to find interesting zones. The next 
subsections describe the algorithm in more details. 


Exploration of Input Values 


For the instance x,, P(C, | x,) is the “natural” value 
of the model output. We propose to modify the 
values of the explanatory variables or lever vari- 
ables in order to study the variation of the model 
output for this example. In practice, we propose 
to explore the values independently for each ex- 
planatory variable. Let P (C, |x, b) be the output 
model f. given the example x, but for which the 
value of its j” component has been replaced with 
the value b. For example, the third explanatory 
variable is modified among five variables: P (C, 
|x, b) =f (x, x7, b. x x2). By scanning all the 
variables and for each of them all the set of their 
possible values, an exploration of “potential” 
values of the model output is computed for the 
example x,. 


Domain of Exploration 
of Each Variable 


The advantage of choosing the empirical prob- 
ability distribution of the data as domain of 
exploration has been showed experimentally in 
(Breiman, 2001; Lemaire et al., 2006; Lemaire & 
Féraud, 2008). A theoretical proof is also avail- 
able for linear regression in (Diagne, 2006) and 
for naive Bayes classifiers in (Robnik-Sikonja 
& Kononenko, 2008). Consequently the values 
used for the J explanatory variables will be the 
values of the K examples available in the training 
database. This set can also be reduced using only 
the distinct values: let N be the number of distinct 
values of the variable X. 


Results Ranking 


The exploration of the explanatory variables or 
of the lever variables is done by scanning all 
the possible values taken by the examples in the 
training set. When the modification of the value 
of the variable leads to an improvement of the- 
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Figure 1. Algorithm 1: Exploration and ranking of the score improvements 


+ 
For the example (the customer) zy do 
wed: 
For all the explanatory variables Y, from ; = 1 to j= J do 
For all the n, different values ¢ tynd of the variable Y; from z = 1 ton = N, do 
If Pi(C.|re.b = via) > PiC.| re) then 
g ‘afe] = Uini 
PE ‘afee] = Pi | ee 
YCale] = 9 
else 
í afe] =U 
PCafie)] = 0.0 
Ye ‘a fe] =) 
end If 
Weel: 
end For 
end For 


Decreasing sort, using the values of PCak], Cale]. X Cale}. 


end For 


a i nD 


probability predicted by the model, three pieces 
of data are kept (i) the value which leads to this 
improvement (Ca); (ii) the associated improved 
probability (PCa); and (iii) the variable associ- 
ated to this improvement (XCa). These triplets 
are then sorted according to the improvement 
obtained on the predicted probability. Note: if 
no improvement is found, the tables CA and PCa 
only contain null values. 

It should also be possible (i) to explore jointly 
two or more explanatory variables; (ii) or to use 
the value (Ca[0]) which best improves the output 
of the model (P(C, | X =x)) (this value Ca[0] is 
available at the end of the Algorithm) and then 
to repeat again the exploration on the example X 
on its others explanatory variables. These other 
versions are not presented in this chapter but will 
be the focus of future works. 


Cases with Class Changes 


When using Algorithm 1 (Figure 1), the predicted 
class can change. Indeed it is customary to use the 
following formulation to designate the predicted 
class of the example x,: 


arg max P(C, |x,) 
Z 


UsingAlgorithm 1 forx, belonging to the class 
t (t#z) could produce P(C, |x, b) > P(C, | x). is 
this case the corresponding value (Ca) carries 
important information which can be exploited 

The use of Algorithm 1 can exhibit three types 
of values (Ca): 


e values which do not increase the target 
class probability; 
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* values which increase the target class prob- 
ability but without class change (the prob- 
ability increase is not sufficient); 

* values which increase the target class prob- 
ability with class change (the probability 
increase is sufficient). 


The examples whose predicted class changes 
from another class to the target class are the 
primary target for specific actions or policies 
designed to increase the occurrence of this class 
in the real world. 


Case of a Naive Bayesian Classifier 


A naive Bayes classifier assumes that all the ex- 
planatory variables are independent knowing the 
target class. This assumption drastically reduces 
the necessary computations. Using the Bayes 
theorem, the expression of the obtained estimator 
for the conditional probability of a class C, is: 


RECIT P(X, = V C,) 


E O N | G) 
(1) 


P(C |x,) = 


The predicted class is the one which maxi- 
mizes the conditional probabilities. Despite the 
independence assumption, this kind of classifier 
generally shows satisfactory results (Hand & 
Yu, 2001). Moreover, its formulation allows an 
exploration of the values of the variables one by 
one independently. 

The probabilities POA =Y; |C) (YJ, k, z) are 
estimated using counts after discretization for 
numerical variables or grouping for categorical 
variables (Boullé, 2008). The denominator of 
the equation above normalizes the result so that 
EPC | xJ=1. 

The use ofthe Algorithm 1 requires to compute 
P(C, | X =x,), and P(C | X =x, b) which can be 
written in the form of Equations 2 and 3: 


L, 
ez 


PECITE PO = uhe) 
PIC aa T z - — J jk | ~z 
ei PC) as RA = Vig C) 
(2) 
P(C, |x,,5) P(C.) Tas PIX, = va | CPX, = b|C.) 


mor ra =e], o 


(3) 


In Equations 2 and 3 numerators can be written 
as e? and e with: 


L, = log(P(C,)) + $ logi PX, =C) 


and 


J r 
L, = log(P(C,)) + }, |log PX, = v.C) 
e 


j=1.j=q 


log (P(X, = b|C,)) 


This formulation will be used below. 


Implementation Details on 
Very Large Databases 


To measure the reliability of our approach, we 
tested it on marketing campaigns of France Tele- 
com (results notallowed for publication until now). 
Tests have been performed using the PAC platform 
(Féraud, Boullé, Clérot, & Fessant, 2008) on dif- 
ferent databases coming from decision-making 
applications. The databases used for testing had 
more than 1 million of customers, each one rep- 
resented by a vector including several thousands 
of explanatory variables. These tests raise several 
implementation points enumerated below: 
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e To avoid numerical problems when com- 
paring the “true” output model PC, 1%) 
and the “explored” output P(C; |My: D, 
P(C, | x,) is computed as: 


P(C |e) = 


where 


J 
L = log(P(C,)) + $ log(P(X, =v, |C) 
j=l 
° To reduce the computation time: the modi- 
fied output of the classifier can be comput- 
ed using only several additions or subtrac- 
tions since the difference between L, (used 
in Equation 2) and L_, (used in Equation 3) 
is: 


L,,=L,- log(P(x =v, |C_))+ log(P(X =b |C,)) 


e Complexity: For a given example X, the 
computation of tables presented in 


d 
Algorithm 1 is of complexity o> N,). 
j=l 
This implementation is “real-time” and can be 
used by an operator who asks the application what 
actions to do, for example to keep a customer. 


EXPERIMENTATIONS 


In this section we describe the application of our 

proposed method to three illustrative examples. 

This first example, the Titanic database, illustrates 

the importance of lever variables. The second 
example illustrates the results of our method on 
the dataset used for the PAKDD 2007 challenge. 
Finally, we present the results obtained by our 
method on a government data problem, the analy- 
sis of the type of contraceptive used by married 
women in Indonesia. 
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The Titanic Database — Data 
and Experimental Conditions 


In this first experiment the Titanic (www.ics.uci. 
edu/~mlearn/) database is used. This database 
consists of four explanatory variables on 2201 
instances (passengers and crew members). The 
first attribute represents the class trip (status) of 
the passenger or if he was a crew member, with 
values: Ist, 2nd, 3rd, crew. The second (age) gives 
an age indication: adult, child. The third (sex) 
indicates the sex of the passenger or crew: female 
or male. The last attribute (survived) is the target 
class attribute with values: no or yes. Readers can 
find for each instance the variable importance and 
the value importance for a naive Bayes classifier 
in (Robnik-Sikonja & Kononenko, 2008). 

Among the 2201 examples in this database, a 
training set of 1100 examples randomly chosen 
has been extracted to train a naive Bayes classifier 
using the method presented in (Boullé, 2008). The 
remaining examples constitute a test set. As the 
interpretation of a model with low performance 
would not be consistent, a prerequisite is to check 
if this naive Bayes classifier is correct. The model 
used here(Guyon, Saffari, Dror, & Bumann, 2007) 
gives satisfactory results: 


° Accuracy on Classification (ACC) on the 
train set: 77.0%; on the test set: 75.0%; 

° Area under the ROC curve (AUC) 
(Fawcett, 2003) on the train set: 73.0%; on 
the test set: 72.0%. 


The purpose here is to the see another side of 
the knowledge produced by the classifier: we want 
to find the characteristics of the instances (people) 
which would have allowed them to survive. 


Input Values Exploration 
Algorithm 1 has been applied on the test set to 


reinforce the probability to survive. Table 1 shows 
an abstract of the results: (i) it is not possible to 
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Table 1. Ranking of explanatory variables 


Status / Age / Sex 
58 


118/ 125/100 


Predicted ‘no’ 0/0/758 


increase the probability for only one passenger or 
crew; (ii) the last column indicates that, for persons 
predicted as surviving by the model (343 people), 
the first explanatory variable (status) is the most 
important to reinforce the probability to survive 
for 118 cases; then the second explanatory variable 
age) for 125 cases; and at last the third one (sex) 
for 100 cases. (iii) For people predicted as dead 
by the model (758) the third explanatory variable 
(sex) is always the variable which is the most 
important to reinforce the probability to survive. 
These 758 cases predicted as dead are men 
and ifthey were women their probability to survive 
would increase sufficiently to survive (in the sense 
that their probability to survive would be greater 
than their probability to die). Let us examine then, 
for these cases, additional results obtained by 
exploring the others variables using the algorithm 
ile 


e the second best variable to reinforce the 
probability to survive is (and in this case 
they survive): 

e for 82 of them (adult + men + 2™ class) the 
second explanatory variable (age); 

° for 676 of them (adult + men + (crew or 
3 class)) the first explanatory variable 
(status); 

° the third best variable to reinforce the 
probability to survive is (and in this case 
nevertheless they are dead): 

° for 82 of them (adult + men + 2™ class) the 
first explanatory variable (status); 

. for 676 of them (adult + men + (crew or 
3" class)) the second explanatory variable 


(age). 


Ofcourse, in this case, most explanatory vari- 
ables are not in fact lever variables, as they cannot 
be changed (age or sex). The only variable that 
can be changed is status, and even in this case, 
only for passengers, not for crew members. The 
change of status for passengers means in fact 
buying a first class ticket, which would have al- 


_ lowed them a better chance to survive. The other 


explanatory variables enable us to interpret the 
obtained survival probability in terms of priority 
given to women and first class passengers during 
the evacuation. 


APPLICATION TO SALE: RESULTS 
ON THE PAKDD 2007 CHALLENGE 


Data and Experimental Conditions 


The PAKDD 2007 challenge data is used (http:// 
lamda.nju.edu.cn/conf/pakdd07/dmc07/): The 
data are not on-line any more but data descriptions 
and analysis results are still available. Thanks to 
Mingjun Wei (participant referenced P049) for 
the data (version 3). 

The company, which gave the database, has 
currently acustomer base of credit card customers 
as well asa customer base of home loan (mortgage) 
customers. Both of these products have been on 
the market for many years, although for some 
reasons the overlap between these two customer 
bases is currently very small. The company would 
like to make use of this opportunity to cross-sell 
home loans to its credit card customers, but the 
small size of the overlap presents a challenge when 
trying to develop an effective scoring model to 
predict potential cross-sell take-ups. 

A modeling dataset of 40,700 customers with 
40 explanatory variables, plus a target variable, 
had been provided to the participants (the list of 
the 40 explanatory variables is available at http:// 
perso.rd.francetelecom.fr/lemaire/data_pakdd. 
zip). This is a sample of customers who opened 
a new credit card with the company within a 
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Table 2. PAKDD 2007 challenge: the first three 
best results 


id AUC for test Modeling 
participant set Rank Technique 

P049 70.01% TreeNet + Logistic 
Regression 


P085 69.99% 2 Probit Regression 
P212 69.62% 3 


MLP + n-Tuple 
Classifier 

specific 2-year period and who did not have an 
existing home loan with the company. The target 
categorical variable “Target_Flag” has a value of 
1 if the customer then opened a home loan with 
the company within 12 months after opening the 
credit card (700 random samples), and has a value 
of 0 otherwise (40,000 random samples). 

A prediction dataset (8,000 sampled cases) has 
also been provided to the participants with similar 
variables but withholding the target variable. The 
data mining task is to produce a score for each 
customer in the prediction dataset, indicating a 
credit card customer’s propensity to take up a 
home loan with the company (the higher the score, 
the higher the propensity). 

The challenge being ended it was not possible 
to evaluate our classifier on the prediction dataset 
(the submission site is closed). Therefore we decide 
to elaborate a model using the 40 000 samples 
in a S-fold cross validation process. In this case 
each ‘test’ fold contains approximately the same 
number of samplesas the initial prediction dataset. 
The model used is again a naive Bayes classifier 
(Boullé, 2008; Guyon, Saffari, et al., 2007). The 
results obtained on the test sets are: 


e Accuracy on Classification (ACC): 98.29% 
+ 0.01% on the train sets and 98.20% + 
0.06% on the test sets. 

° Area under the ROC curve (AUC): 67.98% 
+ 0.74% on the train sets and 67.79% + 
2.18% on the test sets. 
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° Best results obtained on one of the 
folds: Train set AUC=68.82%, Test set 
AUC=70.11%. 


Table 2 shows the first three best results and 
corresponding method ofwinners ofthe challenge. 
Results obtained here by our model are coherent 
with those of the participants of the challenge. 


Input Values Exploration 


The best classifier obtained on the test sets in the 
previous section is used. This naive Bayes classi- 
fier (Boullé, 2007) uses 8 variables out of 40 (the 
naive Bayes classifier takes into accountonly input 
variables which have been discretized (or grouped) 
in more than one interval (or group) see (Boullé. 
2006)). These 8 variables and their intervals of 
discretization (or groups) are presented in Table 3. 
All variable are numerical except for the variable 
“RENT_BUY_CODE” which is symbolic with 
possible values of ‘O’ (Owner), ‘P’ (Parents), ‘M” 
(Mortgage), ‘R’? (Rent), ‘B’ (Board), ‘X? (Other). 

The lever variables were carefully chosen by 
using their specification (see http://lamda.nju.edu. 
cn/conf/pakdd07/dmc07/ or the appendix A). 
These lever variables are those for which a com- 
mercial offer to a customer can change the value. 
We define another type of variable which we will 
explore using our algorithm, the observable vari- 
ables. These variables are susceptible to change 
during a life of a customer and this change may 
augment the probability of the target class, the 
propensity to take up a home loan. 

In this case, the customers for which this vari- 
able has changed can be the target of a specific 
campaign. For example the variable “RENT 
BUY_CODE” can not be changed by any offer but 
is still observable. The customer can move from 
the group of values [O,P] (‘O’ Owner, ‘P’ Parents) 
to[M,R,B,X](‘M’ Mortgage, ‘R’ Rent, ‘B’ Board. 
‘X’ Other). Among the eight variables (see Table 
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Table 3. Selected explanatory variables (there is 
no reason in (Boullé, 2006) to have two intervals 
yor each variable, it is here blind chance) 


or Group 1 | or Group 2 
Pwirex | or _| 


Explanatory Variables 
RENT_BUY_CODE 

PREV_RES_MTHS 
CURR_RES_MTHS 
B_ENQ_L6M_GR3 
B_ENQ L3M 
B_ENQ_L12M_GR3 
B_ENQ L12M GR2 
AGE_AT_APPLICATION 


3) chosen by the training method of the naive 
Bayes classifier, two are not considered as ‘lever’ 
variables or observable variables (““AGE_AT_AP- 
PLICATION” and “PREV_RES_MTHS’’) and 
will not be explored. 

Algorithm 1 has been applied on the 40700 
instances in the modeling data set. The ‘yes’ class 
of the target variable is chosen as target class (C = 
‘yes’). This class is very weakly represented (700 
positive instances out of 40700). The AUC values 
presented in Table 2 or on the challenge website 
does not show if customers are classified as ‘yes’ 
by the classifier. Exploration of lever variables 
does not allow in this case a modification of the 
predicted class. Nevertheless Table 4 and Figure 2 
show that a large improvement of the ‘yes’ prob- 
ability (the probability of cross-selling) is possible. 

In Table 4 the second column (C2) presents 
the best P (C, | x, b) obtained, the third column 
(C3) the initial corresponding P(C, | x, b), the 
fourth column (C4) the initial interval used in the 
naive Bayes formulation (used to compute P(C, 

x, b))and the last column (CS) the interval which 
gives the best improvement (used to compute 
PC, | x, 5)). This table shows that: 


Table 4. Best P(C_)='yes’ obtained 


Cliexploredvariable | c2 | c3 | C4 
amaror [025 [on | easi | s 
Peo Lam a |036 [ous] reast | taser | 
0.36 | 0.24 | [0.5,t00f 


° for all lever or observable variables, there 
exists a value change that increases the 
posterior probability of occurrences of the 
target class; 

e the variable that leads to the greatest prob- 
ability improvement is B_ENQ_L3M (The 
number of Bureau Enquiries in the last 
3 months), for a value in [1.5,+00[ rather 
than in ]- «,1.5[; This variable is an ob- 
servable variable, not a lever variable, and 
means that a marketing campaign should 
be focused on customers who contacted 
the bureau more than once in the last three 
months 

e nevertheless, none of those changes leads 
to a class change as the obtained probabil- 
ity (P(C, | x, b)) stays smaller than P(C, 


|x). 


In Figure 2 the six dotted vertical axis represent 
the six lever or observable variables as indicated 
on top or bottom axis. On the left hand size of 
each vertical axis, the distribution of P(C, | x) is 
plotted (0) and on the right hand size the distri- 
bution of P (C, | x, b) is plotted (v). Probability 
values are indicated on the y-axis. In this Figure 
only the best P(C, | x, b) (PCa{0] in Algorithm 
1) is plotted. This figure illustrates in more details 
the same conclusions as given above. 
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Figure 2. Obtained results on PC, | x, b) 


CURR_RES_MTHS 


Probability 


RENT BUY CODE 3 ENO L6M GR 


APPLICATION TO GOVERNMENT 
DATA: RESULTS FOR THE 
CONTRACEPTIVE METHOD 
CHOICE DATA SET 


Data and Experimental Conditions 


The Contraceptive Method Choice Data Set is an 
available data set in the UCI Machine Learning 
Repository. This data set is a subset of the 1987 
National Indonesia Contraceptive Prevalence Sur- 
vey. It consists of 1473 instances, corresponding 
to married women either not pregnant or who did 
not know if they were at the time of the survey. 
The problem is to predict, from 9 explanatory 
variables (age, education, husband’s education, 
number of children ever born, religion, working 
or not, husband’s occupation, standard of living 
index, good media exposure or not) the type of 
contraceptive method used (no contraceptive 
method, short-term contraceptive method or long- 
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term contraceptive method). Three explanatory 
variables are binary (religion either Islam or not. 
working or not, and good media exposure or not). 
two are numerical (age and number of children 
ever born) and the others are categorical. 

The model used is a selective naive Bayes 
classifier (Boullé, 2007), trained on 75 percent 
of the dataset (1108 instances), the rest of the 
dataset being used for testing purposes. On the 
training subset, we obtained an AUC (Area Under 
ROC Curve) of 0.74, and an AUC of 0.73 for the 
test subset. 


Input Values Exploration 


The selective naive Bayes classifier (Boullé, 2007) 
uses 8 of the 9 explanatory variables, discarding 
the binary variable working or not. Among these 
variables, only two are chosen as lever variables. 
education and good media exposure or not. The 
other variables are not considered as possible 
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Table 5. Number of instances for each predicted 
Jass and level of education 


contraceptive 


Low 
education 


Middle 
education 
High 
education 


targets for policies. Education is a categorical 
variable with four values from 1 (low education) 
14 (high education), partitioned into three groups 
5y the classification algorithm: low education 
value 1), middle education (values 2 or 3) and 
nigh education (value 4). Algorithm 1 has been 
applied on the 1473 instances. The target variable 
s in this case a three class variable (no contra- 
ceptive, short-term contraceptive, and long-term 
sontraceptive). As the proposed algorithm can 
saly try to increase the probability of one class, 
* was applied twice, once to try to increase the 
srobability of using a short-term contraceptive, 
ace to try to increase the probability of using a 
ong-term contraceptive. 
Applying our method to increase the probabil- 
“y of using a long-term contraceptive showed that 
Se most significant lever variable is the education 
evel. Table 5 indicates the number of instances for 
zach predicted class and each level of education. 
Outof 1473 instances, 577 instances are already 
= a high education level. Out of the remaining 
#95 instances, 99 were predicted to switch from 
=o contraceptive to a long term contraceptive if 
the education level was changed from whatever 
value (low or middle) to a high value, and 30 
astances were predicted to switch from short term 
contraceptive to long term contraceptive with the 
same change in education level. Media exposure 
do not seem to have any significant impact (only 
2 instances of ‘class changes’ to long term con- 


traceptive, by changing the media exposure to 
good media exposure). 

Applying our current method to increase the 
probability of using a short term contraceptive, 
157 instances were predicted to switch from no 
contraceptive to short term contraceptive with 
a higher education, and 18 with change to good 
media exposure. This example illustrates the great 
importance of education level for the choice of 
contraceptive in developing countries. 


CONCLUSION AND 
FUTURE TRENDS 


In this chapter we proposed a method to study 
the influence of the input values on the output 
scores of a probabilistic model. The method has 
first been defined in a general case valid for any 
model, and then been detailed for naive Bayes 
classifier. We also demonstrate the importance 
of “lever variables”, explanatory variables which 
can conceivably be changed. Our method has first 
been illustrated on the simple Titanic database in 
order to show the need to define lever variables. 
Then, on the PAKDD 2007 challenge databases, 
a difficult problem of cross-selling, the results 
obtained show that it is possible to create efficient 
indicators that could increase sells. Finally we 
demonstrated the applicability of our method toa 
government data case, the choice of contraceptive 
for Indonesian women. 

The case study presented on the Titanic data- 
set illustrates the point of applying the proposed 
method to accident research. It could be used for 
example to analyze road accidents or air accidents. 
Inthe case of the air accidents any new planecrash 
is thoroughly analyzed to improve the security 
of air flights. Despite the increasing number of 
plane crashes, the relative frequency of those in 
relation to the volume of traffic is decreasing and 
air security is globally improving. Analyzing the 
correlations between the occurrence of a crash 
and several explanatory variables could lead to a 
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new approach to the prevention of plane crashes. 

This type of relationship analysis method has 
also great potential for medicine applications, in 
particular to analyze the link between vaccination 
and mortality. The estimated 50% reduced over- 
all mortality currently associated with influenza 
vaccination among the elderly is based on stud- 
ies neither fully taking into account systematic 
differences between individuals who accept or 
decline vaccination nor encompassing the entire 
general population. The proposed method in this 
paper could find interesting data for infectious 
diseases research units. Another potential area of 
application is the analysis of the factors causing 
a disease, by investigating the link between the 
occurrence of the disease and the potential factors. 

The proposed method is very simple but ef- 
ficient. It is now implemented in an add-on of 
the Khiops software (see http://www.khiops. 
com), and its user guide is available at: http:// 
perso.rd.francetelecom. fr/lemaire/understanding/ 
Guide.pdf 

This tool could be useful for companies or re- 
search centers who want to analyze classification 
results with input values exploration. 
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KEY TERMS AND DEFINITIONS 


Classifier: A mapping from a (discrete or 
continuous) feature space X to a discrete set of 
labels Y. 

Probabilistic Classifier: A classifier with the 
probability of each label (class) as output. 

Exploration: Attempt to develop an initial, 
rough understanding of some phenomenon. 

Correlation: The strength and direction of a 
linear relationship between two variables. 
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Supervised Learning: Supervised learning is 
a technique for learning a function (a mappin g 
from training data. 

Variable Importance: Measure ofthe impor- 
tance of a variable for the output of a classifier. 

Sensibility Analysis: Analysis of the infle- 
ence of a change in input variable on the output 
of the classifier. 


ENDNOTE 


l The description of the proposed method is 


done only for classification problems but 
the method is easily adaptable for regression 
problems 
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ABSTRACT 


In large-scale retail trade, a very significant problem consists in analyzing the response of clients to 
product promotions. The aim of the project described in this work is the extraction of forecasting models 
able to estimate the volume of sales involving a product under promotion, together with a prediction of 
the risk of out of stock events, in which case the sales forecast should be considered potentially underes- 
timated. Our approach consists in developing a multi-class classifier with ordinal classes (lower classes 
represent smaller numbers of items sold) as opposed to more traditional approaches that translate the 
problem to a binary-class classification. In order to do that, a proper discretization of sales values is 
studied, and ad hoc quality measures are provided in order to evaluate the accuracy of forecast models 
taking into consideration the order of classes. Finally, an overall system for end users is sketched, where 
the forecasting functionalities are organized in an integrated dashboard. 


INTRODUCTION 


Important business decisions and organization 
require a scientific framework to make systematic 
analysis of alternatives, as recognized since Taylor’s 
classic work “The Principles of Scientific Manage- 
ment” (Taylor, 1911), that essentially marks the 
beginning of the Decision Science field. 
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A fundamental task in decision science is fore- 
casting, involved in most decision making processes, 
sometimes even at an unconscious level. The basic 
idea is that the known history of the market (global 
or limited to a single company organization) can 
help to induce reasonable guesses of the effects of 
an action, therefore providing a valuable support in 
evaluating the several alternatives business manag- 
ers typically have to sift through, and in choosing 
the most promising one. 
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Forecasting in the business context, and sales 
forecasting in particular, can be studied and ap- 
plied at several different levels: 


° in global market analysis, also called the 
macro level, where offers and demands 
are studied in the general context of global 
market without a specific focus on single 
products or services; 

° in sector- or product-specific market analy- 
sis, also called the micro level, where the 
above analysis is focused on single or fam- 
ilies of products/services; 

° in within-company market analysis, where 
the focus is in ascertaining the health status 
of the company activities, mainly in a evo- 
lution perspective that allows to recognize 
trends and possible weak points (actual or 
future), and in evaluating the future over- 
all effects of actions to be taken within the 
company; 

° in within-company product-specific sales 
analysis, where a single product is put un- 
der the lens of a microscope and analyzed 
in detail, highlighting its performances and 
its reactions to various kinds of inter-com- 
pany stimuli (e.g., promotions or change of 
exposition level) and external ones (e.g., 
the introduction in the market of a new 
competitor product). 


The aim of this work is to analyze a real 
case study in the latter context, focusing on the 
effects of promotions on the sales of a single 
product, mainly aimed at optimizing its stok- 
ing. The closed world context of such analysis, 
on one hand simplifies the forecasting problem 
by omitting external factors that are difficult to 
handle and that, in some cases, might have a large 
uncertainty; on the other hand, it allows to work 
with a complete and detailed history of previous 
sales, precise measures directly collected by the 
company, and even permits (to an economically- 
limited extent) to empirically evaluate models and 
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strategies on the ground, by replicating situation 
and later measuring the effects. 

This chapter contributes to two main tome 
covered by this book, namely the role and impex 
of data mining in the management and in =e 
decision making tasks within private organ = 
tions. In particular, the chapter provides a vies 
of both methodological aspects of the problem 
technical solutions adopted, and empirical ress 
on the field. 

Coop Italia is today the largest holding 
large retailers and the largest organization 
consumers of Italy. Overall, Coop Italia include 
163 consumer cooperatives with approximate 
1261 stores, 6 million members and over 52.0% 
employees. As part of this large organization 
Unicoop Tirreno is a great reality of organize: 
distribution that is present in Tuscany, Laz 
Umbria and Campania with 112 stores (in 3 2= 
ferent size store), more than 770,000 membe= 
and approximately 6,300 employees. In 2007 
exceeded 1.16 billion euro of total sales. 

In this context Unicoop Tirreno decided 3 
develop Business Intelligence solutions, reactive 
to market changes, and start the project Busines 
Intelligence and Data Warehouse (BI-Coop). Th 
objectives of the BI-Coop project can be sum 
marized as follows: (one) to create and populate 
a data warehouse from the operational data anc 
to create interactive data reports (two) to develoe 
forecasting models through the use of data mining 
technologies. In particular, data mining is used & 
predict customer defection and promotion sales 
previsions. 

In this paper we describe the methodolog: 
and results obtained on models for promotice 
sales. This forecasting task can use only prome- 
tion features and sales data over the recent past 
Moreover, in order to face this problem, we need 
to consider a side effect of promotions: the sœ 
called “out of stock” phenomenon, i.e. the evem 
a store is found out of products to seli before the 
promotion is finished, a signal of an incorress 
storage estimate and a cause of lost income. Ow 
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of stocks (OOS) are not currently tracked in the 
operational database, making it difficult to quantify 
the extent of the phenomenon. 

The forecast models described in this work 
have been developed using SPSS Clementine 
(Clementine, 2009), and they will be used by Coop 
in marketing planning to optimize the storage of 
goods in shops and to develop new promotions. 


BACKGROUND 


Forecasting is the process of estimation in un- 
known situations and Sales Forecasting is used 
in the practice of Customer Demand Planning in 
everyday business forecasting for manufactur- 
ing companies. It is a complex problem that has 
to take into account many different factors, not 
always easy to model and to control (new com- 
petitors, economic changes, and social events). 
Many approaches were developed to capture all 
these aspects, from statistical to machine learning 
ones, to develop decision support systems aimed 
to help managers. 

Data mining identifies trends within data that 
go beyond simple data analysis, through the use 
of sophisticated algorithms: it could be described 
as the process of extracting hidden patterns from 
data. One of the first examples of data mining ap- 
plication to retail system was the market baskets 
analysis to understand the purchase behavior of 
groups of customers, and use it to increase sales, 
and for cross-selling, store design, discount 
plans and promotions. Developing data mining 
techniques for marketing is an ongoing tasks 
as recent books can confirm (Berry, 2004) and 
(Ohsawa, 2009). 

In this context, predictive data mining models 
for sales data are classification models. Classifica- 
tion is a procedure in which individual items are 
placed into groups based on quantitative informa- 
tion on one or more characteristics inherent in the 
items (referred to as traits, variables, characters, 


etc) and based on a training set of previously 
labeled items. 

Promotions analysis could not be separate 
from OOS phenomenon. In our scenario, OOS is 
the event a store is found out of products to sell 
before the promotion is finished. In some cases 
this event is referred as Out Of Shelf, if referred 
only to shelf availability analysis. Anyway, the 
corporate data warehouse does not have data 
about the quantity of goods in storage, so the 
occurrence of the OOS event has been derived 
from an analysis of sales data. The definition was 
made in conjunction with experts, essentially by 
identifying sharp decreasing in sales quantity over 
each promotion day. 


Related Works 
Sales Forecasting 


There exists a wide literature regarding sales 
forecasting. However, most approaches focus on 
time series analysis and prediction (for a survey, 
see for instance (Arsham, 1994)), which present 
two big drawbacks in our context: 


e time series analysis is based on the extrac- 
tion of trends and other behavioral models 
that are then matched to the current situ- 
ation to forecast future values, implicitly 
assuming that a model that captured the 
past behavior of the system is applicable 
to the present situation. However, in our 
context, we aim to predict what happens in 
response to an external event — a promo- 
tion — that naturally creates a discontinuity 
with the past behavior of the series, there- 
fore compromising the above mentioned 
regular evolution assumption; 

° time series represent only part of the in- 
formation we need to handle. In particular, 
in addition to the sales history of the pro- 
moted product, each promotion has its own 
characteristics which include, among the 
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others, the promotion type, its size, its du- 
ration, and various descriptors of the con- 
ditions under which the promotion takes 
place. An approach purely based on time 
series might not be able to take advantage 
of this important knowledge. 


Other complex approaches like neural net- 
works and regression-based models (linear and 
non-linear), can be appealing for what concerns 
potential accuracy, yet they usually yield mod- 
els that are hard to inspect and interpret. On the 
opposite, in practice, the domain expert (which 
also plays the role of end user) requires that the 
resulting models can be understood, and possibly 
also amended, and therefore simplicity is a must. 
Moreover, another Coop requirement about the 
model is twofold: on one side, they wish to be 
able to quickly modify it in order to satisfy market 
department requests, and on the other side, they 
wish to be able to quickly recalculate it on order 
to consider new trends. 

The most natural candidate satisfying all the 
requirements listed above is classification by 
means of decision trees, for instance computed by 
the standard C4.5 algorithm (Quinlan, 1992) or 
its variants. In particular, decision trees allow to: 


e obtain a simple-to-read model, suitable for 
interaction with the domain expert analyst; 

° take as input both the history of sales (for 
instance in the form of monthly, quarterly 
and yearly aggregates) and the context de- 
scriptors, whatever their nature (numeri- 
cal, categorical); 

e provide as output a set of classes, repre- 
senting the sales bands that are forecast by 
the model. Determining the optimal num- 
ber and size of such bands is a problem that 
is also tackled in this work. 
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Multi-Class Classification 
with Ordinal Classes 


In this scenario, from the market department 
point of view, the goal is to predict a meaningful 
interval of sales, not just a number. Sales, in fact. 
are influenced also by largely unpredictable fac- 
tors, such as social events, weather, traffic, and 
so on, making precise numeric predictions not 
meaningful. Hence, we have decided to discretize 
the objective function (sales amount) into a set 
of classes. 

This step is critical and needs a good trade- 


off among: 


° low number of classes (useful for classifi- 
cation algorithms and also for easy model 
evaluation) 

e significance of the predicted value com- 
pared to storage choices 

e distribution as uniform as possible be- 
tween classes. 


The particular contribution ofour approach was 
to work directly with multi-class ordinals classi- 
fiers. In addition, the measures generally used, 
accuracy and standard deviation, do not perfectly 
fit our problem. Therefore, a specific measure for 
classifiers with ordinal classes is defined. 

Anordinal quantity differs from anominal one 
because it exhibits an order among the different 
values itcan assume. An ordinal attribute could be, 
for example, a temperature measure represented 
by the values Hot, Mild and Cool. It is clear that 
there is an order among those values: Hot > Mild 
> Cool. Standard classification algorithms map 
a set of attribute values to a categorical target 
value. These algorithms generally are unable to 
use ordering information during the classification 
process and treat an ordinal target class attribute 
like a nominal class. However, some information 
is lost when this is done, information that can 
potentially improve the predictive performance 
of a classifier. 
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Real circumstances frequently involve situ- 
ations exhibiting an order among different cat- 
egories represented by the class attribute. There 
are many statistical approaches to this problem, 
but they are generally based on specific distribu- 
tional assumptions for the class values (Herbrich, 
Graepel, & Obermayer, 1999). In recent years 
different approaches for ordinal classification 
were proposed using different approaches: de- 
cision trees (Potharst & Bioch, 2000; Frank & 
Hall, 2001), regression (Herbrich, Graepel, & 
Obermayer, 1999; Lin & Li, 2007; Rennie & 
Srebro, 2005), regression trees (Kramer et al., 
2001), boosting (Freund, 2003), decision rules 
(Dembezynski, Kottowski, & Słowiński, 2007). 


In each of these cases, the proposed algorithms - 


or methodologies improve ordinary results by a 
marginal-to-moderate amount, usually not greater 
than a 5% gain in accuracy. Moreover, this is 
obtained through implementation costs and loss 
in flexibility or comprehension of results. In our 
applicative experience this kind of improvement 
is not critical — nor very significant — for the end 
user. Therefore, in this work we choose to use 
well know algorithm C4.5 (Quinlan, 1992), and 
define new measures of accuracy for evaluating 
the output models. 


Out of Stock 


The stock-out problem has been investigated in 
the area of Inventory Management for over 30 
years and several models have been presented 
W. Hopp and M. Spearman, Factory Physics 
(International edition), McGraw Hill (2000) and 
(Hopp, 2000). On the other hand, the problem 
is mainly discussed in the marketing literature 
from the consumer reaction perspective (Campo, 
2000) and (Emmelhainz, 1991). The focus of our 
investigation is the automatic detection of OOS 
events using rules approaches. For the best of our 
knowledge, the only paper we can compare with 
is (Papakiriakopoulos, 2009). 


The most important difference is that our 
approach is completely automatic: in fact we 
define a model of the OOS event only using data 
of sales. Moreover we extract data from the data 
warehouse: these results in a quick and efficient 
process that can lead to refine experiments at low 
time and resource costs. The data results of the two 
analyses cannot be compared: their data consider 
all the possible goods existing in the stores, instead 
our data are focused only to promotional items. 


DATA MINING ON 
PROMOTIONAL SALES 


Data Exploration 


The analysis focused on promotional sales of stores 
using only promotions on food products. A first 
distribution analysis of volumes sales, through a 
discretization in 25 equidistant (i.e., equal-size) 
bins, exhibits the following behavior: 20.53% of 
the promotions in a single store sold between 0 
and 24 items, the 12.89% sold between 25 and 
49 items, and so on. The distribution shows a 
large number of promotions with a low volume 
of sales: in fact over 50% of the promoted prod- 
ucts sold less than 125 pieces. This is an entirely 
unexpected result since these sales are calculated 
over a period of sales promotion for 15 days and 
for a large store. 

A deeper analysis shows that there are many 
products with zero sales. Regardless of the possible 
reasons (incomplete population of the database, 
promotions never started in some store), for the 
purposes of our analysis it was decided to disregard 
promotions with less than 5 pieces sold. These 
promotions have been eliminated from the tables, 
obtaining the distribution in Figure 1. 

The classification models were built as ad hoc 
models for each store: this solution requires to 
develop more models, nevertheless it produces 
closer previsions of actual sales for each store 
that is the real interest of Coop. 
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Figure 1. Sales volume distribution of promotions with at least 5 items sold 
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Figure 2. Structure of relevant portion of the data warehouse 


Data Preprocessing Issues 


The starting point for this work was the use of 
a corporate data warehouse: among the several 
tables available, the most interesting for our pur- 
pose was the sales data table, characterized by a 
very large number of lines (926,774,117), since 
each line of each cash receipt isarecord. The tables 
interest are 6 and are structured as in Figure 2. 
‘The selected information include the promo- 
tions, their type (or mechanics) and details, the 
goods involved, their position in the product clas- 
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Promo 


PromoDetail 
PromoMechanies 


sification taxonomy and, finally, the stores where 
promotions are performed. 


Mining Table 


Creating the mining table, i.e., the table that col- 
lects the information used in the model mining 
phase needs to define the level of detail of the 
individual records and then determine the ap- 
propriate data aggregation. We have evaluated 
different strategies: the final choice was for one 
row for each promo detail. A promo detail identi- 
fies an item that belongs to a promotion ina store. 
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Table 1. Derived attributes and their description 


Field name 


Sold_Art_3_1 
Sold_Seg_3_1 
Sold_Art_1_0 
Sold_Seg_1_0 


Days_Promotion 


In this way we can get a detailed response to the 
promotion for every article in all the different 
shops where the same promotion was active. 

We used data sales of 16 months in 134 stores 
(522,541,764 records). The data were aggregated 
into 4 time slots, collecting total sales information 
of an item for each store in a single row of the 
table. The use of aggregation creaies a mining 
table of 240,059 rows. The fields of the table are 
the union of the fields present in the 6 separate 
tables shown in Figure 1, for a total of 62 fields, 
with the addition of 5 calculated fields that are 
described in Table 1. 

These attributes are meant to capture the recent 
variations in sales from two different prospective: 
the singe good under promotion and the all seg- 
ment it belongs to. The latter, in particular, allows 
taking in consideration possible sales fluctuations 
that involve the all market segment. Moreover, 
the history of sales is divided in two time slot, in 
order to detect general trends. The target variables 
used to train the models are: (1) the sales amount 
of the promoted item and (2) the number of OOS 
that occurred during the promotion. 


Discretization 


The target variable is continuous and to be able to 
use many classification algorithms it is necessary 
a discretization. Moreover, it is strongly skewed 
to low values and the distribution of sales volume 
is very sparse: the values range between 0 and 
105,650, yet 80% of promotions sold less than 
500 items. 


Description 


Population within 
first 3 classes 


A first attempt to discretize the target variable 
has been an equal-size binning produces too many 
classes. Even increasing the size of the bin, the 
number of classes remains large (unless we adopt 
extremely large intervals) and data are heavily 
unbalanced: this leads to a decrease of prediction 
accuracy. What follows are some examples of 
discretization (See Table 2). 

Classification algorithms are unable to gener- 
ate high accuracy models over such a great num- 
ber ofclasses. As alternative, we tried to performed 
an equal-frequency binning, but with not satisfac- 
tory result. For instance, using 20 bins the result 
is not particularly significant: first, it is non in- 
teresting for market analysis to know whether a 
product will sell between 6 and 14 items or be- 
tween 14 and 23, and it is an extremely useless 
knowledge to know that it will sell more than 
1600 items. 

We manually refined the discretization trying 
to satisfy the following issue: 


° Low number of classes (maximum 20) 
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Figure 3. Discretization through equal-frequency binning (top) and its manual refinement (bottom) 


EFB 


Refined 


e Significance of the predicted value with re- 
spect to subsequent storage choices 
° Uniform distribution among classes 


The chosen discretization, which will be used 
for the forecast model construction, is shown 
in Figure 3 compared to the result of an equal- 
frequency binning. 

The each bar represents the whole dataset, 
divided into 20 bins. The horizontal axis is in 
logarithmic scale, for better emphasizing lower 
value bins. As we can see, the refined discretiza- 
tion “moves” the bins towards higher values, thus 
providinga more detailed division for middle-high 
sales values. The same refined discretization is 
shown in Figure 4 as a bar plot. 


Predicting the Percentage Variation 
of Sales 


In this work, we developed two separate models 
that differ on the variable to predict: intervals in 
volume sales, which used the discretization pre- 
viously defined; and response to the promotion, 
which provides the change percentage in sales with 
respect to the previous time period. In both cases 
we are dealing with the creation of multi-class 
classifiers with ordinal classes. Both the classifiers 
generated, respectively aimed to predict the sales 
volume and its percentage variation w.r.t. recent 
past, have their pros and cons. 

In the first case, it is easy to find a satisfactory 
granularity of the classification classes, but, on 
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the other hand, the resulting classes are strongly 
unbalanced. In the second case, the opposite hap- 
pens, yielding balanced classes that, however. 
might gather largely different absolute values. 
Anyway a comparison of the two models was 
performed, aimed to measure the level of coher- 
enceoftheir predictions. The results could be sum- 
marized as follow: almost half the records yield 
coherent results (i.e., they show a null distance. 
so they make the same prevision), and moreover 
the distance distribution decreases very quickly 
(percentage is under 5% from class distance equal 
to 5 forward), showing that the two classifiers are 
strongly coherent. For this reason, we only refer 
of the second of the two approaches. 

In order to cope with the unbalancing of the 
sales data, we defined a new target function, aimed 
to forecast the percentage variation of sales dur- 
ing the promotion w.r.t. sales in the 15 days that 
preceded the promotion. This variation is able to 
effectively express the real response ofcustomers 
to the promotions. 

A first rough summary of the distribution of 
sales variation during promotions — promotions 
where the promoted article was not sold at all 
during the preceding 15 days were not consid- 
ered in the analysis — shows a largely significant 
increase during promotions: for 91% of goods the 
percentage variation of sales under promotion is 
larger than 20%. Further explorations of the data 
led to the definition of a set of percentage varia- 
tion intervals that reaches a trade-off among the 
following common properties: 
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fg. = 475 62.77 3 9 10111213 14:15 1617 18-49 20 


Table 3.Classes for percentage variations of sales 


e yield a precise information about the sales 
volume of each article, by means of small 
intervals; 

e adopt a small number of classes, in order 
to ease the task of classification algorithms 
and to make the resulting predictive mod- 
els easier to evaluate and to understand; 

° obtain a class distribution as even as 
possible. 


The discretized classes adopted in the suc- 
cessive model extraction phases were the result 


Small increase 1 (variation between + 20% and +100%) 


[e Tares eresse arin avec 0% ana 10%) 
Ta arin 3 Garion ete F000 an 1500 
[o | nee ne 1 rion between #504 nd 200) 
ro | 


Extreme increase 2 (variation > 2500%) 


Bin 


of both data inspection, consideration of the 
algorithms to be used, and the indications and 
practical requirements of the domain experts. The 
result consists of ten intervals of percentage sales 
variations, indicated in the Table 3. 

Using the classes defined above, the distribu- 
tion is the one in Figure 5. 

The figure shows peaks on classes 7 and 10, 
thus indicating that several promotions lead to a 
large or extreme increase of sales. A representative 
insight is depicted in the following graph (Figure 
6), that represents the percentage coverage of the 
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Figure 5. Distribution of promotions along the percentage variation classes defined in Table 3 


2 


variation classes by the three sectors of the food 
category. 

This graph highlights some interesting facts. 
For instance, we can infer that promotions for the 
very fresh sector (highly perishable products) have 
a relatively low response, since they are mostly 
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concentrated over the first 4 classes, where the 
sales show only a small increase or even a drop. 
An opposite behaviour is that of the Jresh sector 
(fresh products), whose promotions tend to fall 
in higher classes, in some cases reaching the same 
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Figure 7. Distribution of class prediction error 
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coverage of the (larger, in terms of promotions) 
grocery sector — e.g., in class 9. 

Following a standard procedure, the input 
dataset was randomly partitioned into a training 
set, containing the 70% of available promotions, 
and a fest set, containing the remaining 30%. The 
records of the former are used to build the classi- 
fier, while the records of the latter are exclusively 
used to validate the classifier. 

The classification model was extracted by 
means of C4.5, a standard decision tree construc- 
tion algorithm, applied with several different 
parameter settings, including various pruning 
strengths, minimum number of records that fall 
in each leaf of the tree, usage of boosting tech- 
niques, etc. The best performing model, which is 
described in the following, had moderate settings: 
a small pruning factor (45) and a small minimum 
population for each leaf (3). The output model 
was produced without any kind of boosting, in 
the form of classification rules. 

Accuracy reaches the 49.99% on the training 
set and 32.67% on the test set. Such apparently low 
percentages are much larger than what a purely 


Distance 


ol 


random classifier can achieve, due to the relatively 
high number of classes (10). The confusion matrix 
has a strongly diagonal structure that attests the 
good quality of the output model. Anticipating the 
more exhaustive discussion provided in Section 5, 
auseful summary of the confusion matrix is given 
in Figure 7, where the corresponding distribu- 
tion of class prediction errors (i.e., the distances 
between predicted and actual classes) is plotted. 
As we can see, the distribution quickly drops 
after the first values, and indeed, ca. 76% of the 
records falls in the first three groups, i.e., 76% of 
predictions have error equal to or smaller than 2. 


Classification Rules 


Inspecting the rules that constitute the classifi- 
cation model obtained above, we can highlight 
some behaviour generally common to all shops. 
In particular, in Table 4 we provide a sample of 
such rules, also characterized by: 


° support — number of cases where the rule 
applied 
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Table 4. Classification rules with support and confidence, including limited tolerance to errors 


Confidence with 
Confidence error <1 


Confidence with error 
<2 


if CATEGORIA = ZUCCHERO E DOLCIFI- 
CANTI 

e FL_VOLANTINO = No 

e VEND_ART_1_0>37 

then class = 2 


if CATEGORIA = ‘ALIMENTI INFANZIA’ e 
VEND_ART_1_0>275 
then class = 3 


if CATEGORIA = CONSERVE DI FRUTTA 
e MESE=8 
then class = 5 


if CATEGORIA = YOGURT 
e DESCRIZ. = TAGLIO PREZZO 
e MESE =9 

e VEND_ART_1_0> 54 

e VEND_SEG_1_0 <= 4487 

then class = 6 


if CATEGORIA = ‘PASTA FRESCA’ e 
MESE = 10e 

VEND_ART 1 0>51 

then class = 7 


if FL_COOP = Si 
e CATEGORIA = BISCOTTI 
e FL_VOLANTINO = Si 

e VEND_ART_1_0 <= 275 
then class = 8 


93% 


82% 


78% 


78% 


° confidence — accuracy of the rule ° Rule 2: similar to Rule 1, but for food for 
children and with an higher threshold (275) 
Beside the basic confidence of rules, Table and a slightly higher gain in sales (class = 
4 reports the confidence values we obtain when 3). 
a prediction error not larger than 1 (column 4) e — Rule 6: biscuits of the Coop brand that sold 
or 2 (column 5) is tolerated. In general, we can less than 25 pieces in the last month before 
see that such small error tolerance increases the the promotion and that was advertised in 
confidence considerably. Some of the rules above the leaflets will dramatically increase their 
(which make use of the original names, in Italian) sales (class = 8, corresponding to a gain up 
can be explained as follows: to 1500%). 


e Rule 1: if more than 37 articles were sold in 
the last month before the promotion (vent_ EVALUATION OF MULTI- 
art_1_0 > 37) in the category “sugar” (cat- CLASS CLASSIFIERS 


egoria = zucchero e dolcificanti), and the 


promotion was not advertised in the adver- Problem Definition 

tising leaflets, the promoted item will sell 

the same or just a slightly higher amount Evaluating the quality of the classifiers gener- 
than before the promotion (class = 2). ated so far by means of a synthetic measure is a 
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challenging problem. Indeed, most of them are 
defined over several ordinal classes, that is to 
say there is a natural order between classes, that 
should be considered in evaluating predictions. 
That is a direct consequence of the fact that such 
classes were obtain through discretization of a 
continuous variable. 

Accuracy, the most commonly used quality 
measure in model evaluation, only considers 
perfect matches between predicted and actual 
classes, counting all other cases simply as generic 
misclassification errors. On the contrary, our 
context suggests that the distance between the 
predicted class and the actual one should be part of 
the evaluation function, thus considering trade-offs 
between perfect matches and perfect mismatches. 
Inthe literature on classification models validation, 
there is a substantial lack of quality measures that 
solve these problems. Therefore, in the following 
we discuss some approaches that provide more 
precise evaluations of the quality of multi-class 
models with ordinal classes. 


Distance from the Diagonal 


The basic step, already mentioned in the previ- 
ous sections, consists in computing the distance 
between the predicted class and the actual class of 
each promotion in the dataset. Then, we plot the 
distribution of these values, which provides the 
means for easily checking the diagonality of the 
confusion matrix, and therefore to obtain a first 
qualitative assessment of the model under analysis. 


Example 1 


In Figure 8 three sample distributions are plotted, 
corresponding to three fictitious classification 
models. In the following we present an analysis 
of how these distributions can help to infer the 
quality of the corresponding classifiers. 

With classifier (a) a large number of records 
falls in the first categories, corresponding to low 
distance values, and the distribution quickly drops 


for larger distances. That means that most records 
are correctly or quasi-correctly classified, and 
only a small number of them is associated to 
classes that are very different from the correct 
one. Therefore, (a) appears to be a good classi- 
fier. 

Classifier (b) was built in such a way that its 
value for the traditional accuracy measure is the 
same as classifier (a). Indeed, leftmost bar on 
both the plots, corresponding to perfectly classi- 
fied elements, show the same length. However, 
it is clear that classifier (a) should be preferred to 
(b), since the latter has an almost uniform error 
distribution for errors greater than zero, meaning 
that small and large errors are equally probable. 

Finally, classifier (c) presents the same prob- 
lems of (b) and, moreover, the peak on the zero- 
error bar is missing, meaning that the classifier 
does not guarantee a good accuracy (in the standard 
sense) or a good percentage of quasi-correct clas- 
sifications. This is a clear case of bad classifier. 


Quantitative Measures 


In order to provide an objective means for evalu- 
ating models or for comparing two of them, a 
quantitative measure should be defined that revises 
the traditional notion of accuracy. This measure 
had to take into consideration the overall class 
errors distribution. In this section we provide 
two improvements of the standard definition of 
accuracy, that essentially compute an aggregate 
countoferrors of the model, weighted on the basis 
of the gravity of each error. 


Weights Vector-Based Approach 


In this approach, we assume the user provides 
a vector weights of N positive (possibly null) 
values, where N is the number of classes in the 
classification problem, and such that each value 
weights[i] (i=0,..., N-1) represents the weight 
associated to the errors of size i. For instance, 
weights[3] represents the weight associated to er- 
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Figure 8. Distribution of class error for three sample classifiers 
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rors where the predicted class has a displacement 
of 3 classes w.r.t. the correct one, whichever is 
the direction of the displacement. Then, a vector 
Jreq of N elements is computed, each value freq/i] 
(i=0,..., N-1) representing the number of promo- 
tions whose predicted class has distance i from 
the correct class. Then, we define the vector-based 
accuracy of the model as follows: 


secon — Doi (Fea [i]: weights) 
> rea |i 


Accuracy 


In principle, the values of weights should fall 
in the range /0, /], in order to preserve the statisti- 
cal meaning of accuracy, however, small negative 
values might be used to penalize some particular 
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errors. Table 5 reports three examples of weight 
vectors, each putting a different emphasis to the 
different types of errors. 

In Weights 1, only the records that are cor- 
rectly classified are considered (distance zero). 
In this case Accuracy" = Accuracy, since the 
computed aggregate has the same value as the 
traditional accuracy. Weights 2 consider as par- 
tially correct also the records that are classified 
close to the correct class. While a perfect predic- 
tion has weight 1, quasi-correct ones have a 
smaller one. Weights 3 differs from the previous 
one because negative weights are associated to 
predictions that are very distant from the correct 
class. That strongly penalizes large errors. 

The effects of these three different weight 
settings are reported in Table 6. 
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Table 5. Three sample weight vectors over 10 
classes ; 


Table 6. Vector-based accuracy over distributions 
of Figure 8 and with weights of Table 7 


Weights1 | Weights 2 


eaa e 
0 ; j 


Weights 3 


The values obtained are coherent with the 
qualitative results discussed above: 


vector 


° classifier (c) has a value of Accuracy 
smaller than classifiers (a) and (b), inde- 
pendently of the weights adopted; 

: classifiers (a) and (b) have the same value 
of standard accuracy (equal to Accuracy"? 
computed with Weights 1) yet they differ 
significantly on the new accuracy measure. 
In particular, classifier (a) performs better 
and classifier (b), especially with Weights 
3, which penalizes the significant amount 
of large errors yielded by classifier (b). 


Limits of the Vector-Based Approach 


The simplicity of the vector-based model causes 
a few drawbacks that might limit its usefulness 
in some contexts: 


e there is no distinction between approxi- 
mations by excess and by defect. In some 
contexts it can be useful to stress the im- 
portance of one error against the other. For 
instance, in the case of stocking of articles, 
it might be convenient to prefer overesti- 
mates that correspond to the risk of having 


Distribution | Distribution Distribution 
(a) (b) (c) 


37,5% 15,0% 
65,7% 47,6% 33,7% 


64,8% 40,9% 31,1% 


Weights 1 
Weights 2 
Weights 3 


stock on hand after the end of a promotion, 
against underestimates, that would corre- 
spond to the risk of getting OOS during the 
promotion. 

° the weight of an error does not directly de- 
pend on the actual class to be predicted. 
Indeed, only the distance between the pre- 
dicted class and the actual one is consid- 
ered. In some cases, it might be useful to 
discriminate also w.r.t. the actual class. For 
instances, when the classes (as in our clas- 
sification approach) are obtained through 
a discretization process that can yield 
discretization intervals of highly variable 
width, the gravity of an error could be de- 
pendant on such interval width, penalizing 
errors performed around larger bins. 


Empirical experimentation on the field tells us 
that Accuracy" yields a precise evaluation of 
the model quality in most practical cases. How- 
ever, when the above mentioned limitations play 
a too strong role to be neglected, a generalization 
of the approach can be followed, which will be 
described later in this section. 


Vector-Based Accuracy of 
the Generated Models 


Following the procedure described above, we 
choose vectors of weights, in order to decide how 
important should be each classification mistake. In 
our experiments, we selected two vectors, shown 
in Table 7, that mainly differ on the severity of. 
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Table 7. Weights for vector-based accuracy of the 
percentage variation model 


Weights 1 


penalties for errors of large extent. Both vectors 
assign an almost linearly decreasing weight to 
errors from zero to 3. However, while in the first 
vector all other cases have a null weight, in the 
second one larger errors receive further penalties. 

Both vector-based accuracies are relatively 
high, therefore attesting the good quality of the 
model extracted. We can observe that the second 
measure, that penalizes large errors, yields a drop 
w.r.t. the first one, meaning stressing more im- 
portance of errors of large extent. Anyway, the 
performances are stable. Indeed, the accuracy 
over Weights 2 is very close to the one obtained 
over Weights 1, meaning that there were not many 
large errors to be penalized, and therefore the 
predicted classes are generally close to the true 
ones. 

The corresponding vector-based accuracies 
are shown in Table 8, together with the value of 
traditional accuracy. 


Matrix Vector-Based Approach 


This solution is a generalization of the vector-based 
accuracy that starts directly from the confusion 
matrix of a model and takes into consideration 
each single error type. In particular, a NxN matrix 
of weights, mat_weights, is provided by the user, 


234 


Forecast Analysis for Sales in Large-Scale Retail Trade 


Table 8. Accuracies for the percentage variation 
model, using weights in Table 7 
accuracy 


Traditional 
accuracy Weights 1 


32,70% 66,10% 


Vector-based 
accuracy 
Weights 2 


62,60% 


Vector-based 


such that each mat_weights/ij] represents the 
weight associated with the cases (promotions, in 
our context) where the true class was i and the 
predicted class is j. The definition of the new 
accuracy is then computed by combining such 
matrix of weights with the confusion matrix 
(mat_confusion): 


DA D (mat _ confusion (i, J| -mat _ weights [i A 
D YS mat _ confusion [i j] 


Accuracy" = 


Example 2 


Table 9 shows a sample confusion matrix that gen- 
erated the first distribution discussed in Example 1 

As we can see, the matrix has a predominance 
of values along the diagonal, and moreover there 
is ahigher density right below the diagonal, mean- 
ing that the model tends to predict values lower 
than the real ones. We apply the approach with 
two, slightly different matrices of weights, shown 
respectively in Table 10 and in Table 11. 

The first matrix penalizes more the errors of 
larger extent, since the weights decrease as the 
distance from the main diagonal increases. More- 
over, the extreme cells of the matrix (lower lef 
and upper right corners) contain negative values. 
since these cases of very large errors highly de- 
grade the usability of a predictive model. The 
second matrix presents a similar structure, but the 
records below the main diagonal are given stron- 
ger penalties, meaning that overestimates of the 
predicted class are preferred to underestimates 
In the context of sales forecasting for stocking 


- mep 
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Table 9. Sample confusion matrix 


Predicted class 
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Table 10. Matrix of weights 1 


purposes, that corresponds to prefer avoiding OOS 
situations that might compromise the effectiveness 
of a promotion. 

The matrix-based accuracy values correspond- 
ing to the two weight matrices presented above 
are summarized in Table 12. The results show 
that the second set of weights introduces a loss 
of accuracy, due to the fact that the classification 
model analyzed, as mentioned above, tends to 
predict values lower than the actual classes, which 
are particularly penalized by the weight matrix. 


Predicted Class 


= 


ae o o a a 
Eae e on a ae 


Forecast Analysis for Sales in Large-Scale Retail Trade 


Table 12. Accuracies for the sales prediction model, using weights in Table 10 and Table 11 


Table 13. Out of stock scenario 


nn ae 


DATA MINING FOR ‘OUT OF STOCK’ 
EVENT 


Every time the number of items available ina shop 
is less than its customer request, an OOS event 
occurs. A consequence is that the good is not on 
shelf. Each OOS is an income failure and some- 
times it could be for significant amount. More, 
this could be a source of customer discontent that 
could lead to the shop abandonment. 

This event is often connected to promotional 
occurrence and could be derived from different 
causes. Most frequent cause is a wrong esteem 
of future sells that have, as consequence, a lower 
stoking number of items with respect to customer 
needs. Others causes could be a delay delivery 
from general warehouse to the shop or could be 
a delay delivery from the local warehouse to the 
shelves. For these reasons is it possible to define 
two different OOS typology: warehouse level or 
shop level. In the first case the warehouse can 
not delivery items to the shop during the promo- 
tion days. This case could be treated only using 
stock data and is not possible in our scenario. In 
the second case, the shop is not furnished dur- 
ing a single day and OOS could be happen. It is 
also possible to have more than one OOS during 
promotions period. This second scenario could be 
treated starting from sell data and it is the analysis 
we propose. A model ad hoc is required. 
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Matrix-based accuracy 
Traditional accuracy Weights 1 


37,50% 70,38% 


Matrix-based accuracy 
Weights 2 


Evening 


i 


Out of Stock Model Definition 


The model was made trying to capture all possible 
scenarios in which OOS could occur. We model 
this phenomenon at the shop granularity using 
a division of a single day into four time slots: 
morning, lunch, afternoon and evening. 

The model, at first approximation, is directed 
towards the detection of abrupt declines in sales 
between two contiguous time slots. The percentage 
change between contiguous time slots is analyzed 
and if this change exceeds a fixed threshold (de- 
fault -90%) we assume an OOS event took place 
(Condition 1). In the sample table an OOS occurs 
between Lunch and Afternoon. 

Considering only the percentage changes 
between the time slots, however, this first formu- 
lation of the model does not capture two possible 
anomalies: the resumption of sales and the exis- 
tence of products with very low sales. 

(Condition 2) If there is a sharp fall in sales 
in a intermediate slot (such as in condition 1), 
but then the sales increase, it is clear that this is 
not an OOS event. To take in account this aspect. 
we need to verify that the sales at the time slot 
right after the fall stops below a threshold, in 
other words there is not a significant upturn in 
sales. In order to provide greater flexibility, this 
threshold value is calculated dynamically using 
the following formula: 
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Table 14. Scenario with gradually decreasing sales 


zs eee ea | 


OutOfStock"”"*""" = min (max (2, critical Value) ; 10) 


The critical value is the sales value that caused 
the OOS. 

(Condition 3) The model with only the percent- 
age changes (condition 1) recognizes an OOS ifa 
product sells 1 unit in all time slots except one in 
which it sells 0 (there is a percentage change of 
-100%). This is clearly a product that has very low 
sales and therefore not a case of rupture of stock. 
To properly handle such situations we introduce 
a new threshold that determines the minimum 
number of units sold which must precede an OOS 
(Default value is 5). 

(Condition 4) Condition 3 could mask some 
~ eases. In the scenario of Table 14 there is a strong 
decrease percentage between lunch and afternoon 
which satisfies the condition 1, but would not be 
regarded as OOS as it hasnot checked the condition 
3. In these cases the percentage decline is gradual 
and involves several time slots. To consider this 
case, too, a second threshold of percentage varia- 
tion is introduced in the model and it is set to a 
lower value with respect of Condition 1 (default 
-70%). 

Now, in case there is a high percentage change 
that checks the condition 1 but not condition 3, 
the lowest threshold it is used, to identify decline 
in sales spanning several time slots. Choices and 
parameters for the thresholds were validated by 
a Coop marketing manager. 


Model Construction 


Data Analysis of food sector in the super stores 
produces the following distribution for the number 


of OOS events calculated as previously defined 
(See Figure 9). 

The OOS event occurs in 44% of cases. More- 
over, for half of these cases it happens more than 
once. Whereas a promotion extends during 15 
days and since the OOS is a daily event, these 
numbers are surprising. Because of the strong 
unbalance of the distribution, it is not difficult to 
build a classifier using the number of OOS in the 
promotion as objective function. Also in this case, 
it is necessary to use a discretization to allow the 
classifier to achieve good results. Two possible 
discretizations of the variable representing the 
number of OOSs are possible: (1) in three class- 
es (zero vs. one vs. more than one) or (2) in two 
classes (zero vs. at least one). The class distribu- 
tion for both cases is provided in Figure 10. 

The first solution choice is certainly more 
convenient for the significance of the forecast: 
we have three different values that identify the 
degree of risk of OOS event. The second solution, 
on the other hand, has a more balanced distribu- 
tion (56% - 44%) and therefore it is more suitable 
for classification. 

Both as regards the division of records between 
training set and test set, both for the choice of 
predictors used, the same considerations out- 
lined in the previous chapter in connection with 
the construction of models for forecasting sales 
volumes are applied. The main parameters chosen 
for the construction of this model are similar to 
those used for sales forecast: pruning severity = 
45, min. number of cases per leaf=3, no boosting, 
and output model in form of rules. The accuracy 
on the training set is found to be 75.14% and on 
the test set of 72.61%. 
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Figure 9. Distribution of out of stocks in the Super stores 
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Generalizing the analysis of data for the food 
sector in the hyper stores, the distribution that 
results is very similar to the previous one. OOS 
are 65.9% and the number decreases with the same 
trend, even if more quickly. In this case too, the 
discretization was binary. The parameters used for 
the construction were the same as for the previous 
case, excepted for the pruning severity, set to 75 
instead of 45. The accuracy found on the training 
set is 71.65% and on the test set of 67.58%. This 
accuracy decrease is an expected result, given the 
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greater variety of goods and promotions taking 
place in hypermarket stores. 

Creating the classifier as a set of association 
rules permits to verify the existence of some 
interesting phenomena. The following table 
shows some general rules that fit well, in which 
support and confidence values are calculated on 
all stores data. 


e Rule 3 states that promotions involving 
food for children that were advertised in 
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Table 15. Classification rules for out of stock prediction 


Pane a, ae ae es Stor Confidence 
if PRES_ MKT = LEADER 
and VendSeg_1_0 > 479 E 
and CATEGORIA = CAFFE’ 360 8g% 
then class = | | 

| if EL_VOLANTINO =Si 
and CATEGORIA = ACQUE 
and VEND SEG_1_0 <= 78583 219 IA 
then class = 1 
if FL_VOLANTINO = Si 
and CATEGORIA = ALIMENTI INFANZIA 
and VEND_ART_3_1> 142 677 65% 
and VEND_ART_1_0>96 
then class = | EEN 
if FL_VOLANTINO = Si 
and CATEGORIA = ALIMENTI PER GATTI a 76% 
and VEND_SEG_1_0 > 4369 $ 
then class = 1 
if MESE = 12 
and FL_VOLANTINO =Si 3671 64% 
then class = 1 
if MESE = 12 : 
and CATEGORIA = GELATI then class = 0 “il ta 
if RILEVANZA = IGIENE 
and CATEGORIA = ALIMENTI INFANZIA E09 aa 
and VEND_ART_1_0 <=45 5 i 
then class = 0 


the leaflets, such that the product sold more 
than 96 pieces in the last month and more 
than 142 in the last three months before the 
promotion, are likely to go OOS. We can 
notice that when the conditions of this rule 
are satisfied, also Rule 2 of Table 6 is, in 
which case we can expect that the sales 
prediction of the latter rule represents an 
underestimate of the real need of stocking 
for the promotion. 

e Rule 1 states that promoted coffee of a 
leader brand that sold more than 479 pieces 
in the last month will most likely go QOS. 
Therefore, the estimates provided by the 
sales forecast models should generally be 
interpreted as underestimates. 

e Rule 6 states that ice creams promoted dur- 
ing December almost never (95% of confi- 
dence) go OOS, which was expected since 


the consumption and request of this kind 
of food is usually very limited during the 
winter season. 


DEPLOYMENT AND 
FUTURE TRENDS 


The project ends with a first deployment of the 
system for end users, generally marketing staff 
managers. The application offers them a dedicated 
area for forecast sales analysis and for consulting 
historic promotions trend. 

Users can connect to a prevision web page, 
shown in Figure 11 through an example, in which 
they can query the system about a sales forecast. 
They can choose the good to promote, the store, 
starting date, the duration and the mechanics of 
the promotion, and then launch the prediction step 
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Figure 11. Prevision Web page 


`} Impostazioni previsione SF 
|é Cie localhost 
Previsione volumi di vendita in promozione 


ei | [18384] - Yogurt Coop Biologico Agrumi Conf. 2 Pezzi 
1 Viterbo (40} i 


et 


— initio (04 agosto 2008 _ durata: 15 ve 


` tipologia | Sconto percentuale 


i (Visualizza statistiche promozioni passate ) 


i { Toma al menu principale | 


(bottom right button in Figure 11). The sample 
parameters in our example are the following: 


e Good = yogurt [product id 18384] 

° Store = Viterbo (near Rome) [id store 40] 
° Starting date = 2008, August, 1 

e Duration of promotion = 15 days 

° Promotion mechanics = 20% discount 


At the end of the computation, results are 
shown in a trend of sales graph, as in Figure 12. 
The time period analyzed covers several months 
proceeding the promotion. Forecasting of sales 
is expressed in terms of number and percentage 
range change over the previous period. It gives 
also an estimate of possible OOS in the promo- 
tion time period. 

The example output shows that the promoted 
yogurt will sell from 240 to 360 items (sales 
forecasting model), or it will increase its sell from 
100% to 200% (percentage variation model). 
Moreover, the item results positive to OOS previ- 
sion. This additional information could lead to 
understand that the forecast is underestimated. 

Users can also access a statistical trend page 
in which to choose a the store, a time period and a 
good. The output is the good sales trend per month 
(in blue in the Figure 13). Using the compare 
button a new trend is drawn (the red on the right 
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side of the figure), representing the average sell 
of other goods in the same marketing segment. 

In this way, marketing end users could browse 
the promotion data warehouse in a convenient 
analysis. 


Evaluation and Future Trends 


Although an exhaustive analysis of the economic 
impact of the proposed approach is difficult and 
requires ad hoc investigation, a rough estimation 
can be easily computed by exploiting Gruen’s 
work (Gruen, 2002). The latter established that 
in a retail context, the direct loss caused by OOS 
attests around 4% of the potential sales of promoted 
articles. On the other hand, our predictive models 
estimate that around 50% of promotions result in 
an increase of sales to over 500% of the normal 
amount, i.e., the amount of sales obtained when 
the article is not under promotion. That means that 
our model can potentially avoid an economic loss 
estimated as: promotions evaluated x Percentage 
of sales loss = 50% x (500% x 4%) = 10% i.e. a 
correct stocking can help avoiding around 10% 
of loss due to OOS. 

The methodological and empirical results ob- 
tained in this work provide the basis for several 
improvements and refinements of the approach 
adopted. Among them, we mention the following: 


Figure 12. Web page — forecast prevision 


Risultati previsione E 
e> Ci hip /Hocalhost 383 


Previsione volumi di vendita in promozione 


da 250 a 300 pezzi 
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+100% a +200% 
(da 240 a 360 pezzi) 


locaihost E: E 


Í confronto con segmento 


The two models proposed, one for sales 
forecasting and the other for OOS predic- 
tion, might integrated, in order to obtain a 
more sophisticated sales forecasting model 
that does not suffer of under-estimation 
problems. 

The kind on analysis and models pro- 
duced so far might be extended and 
adapted to the case of the articles that 
are not directly involved by promotions, 


comparing the results with those obtained 
by Papakiriakopoulos (2009) 

Similarly, the behaviours of promoted and 
non-promoted articles might be analysed 
in a comparative way, and thus allowing to 
understand both effects and side effects of 
promotions. 

A deeper analysis might be performed for 
checking the predictive value of the vari- 
ables that the models selected as important. 
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The goal is to reveal potential unexpected 
dependencies between sales/OOS and 
some “less credited” attributes. 

e Finally, more sophisticated methods for es- 
timating the economic impact of the whole 
approach, as well as empirical measure- 
ments, might be developed and performed, 
in order to better validate the methodology 
proposed. 


CONCLUSION 


The work described in this chapter is part of the 
BI-Coop project conducted by the KDD Labora- 
tory at ISTI-CNR (Pisa, Italy) in collaboration 
with Unicoop Tirreno and is related to large re- 
tails. The goal was to analyze the trends in sales 
of articles offered for promotion to improve the 
quality of storage of such products. As such, 
this work provides an example of methodologi- 
cal approach, technical solutions and empirical 
evaluations of data mining methods at work in 
the context of management/decision making for 
private organization. 

Our contribution is twofold. On the project/ 
application side for the prediction of sales (both 
absolute value and in percentage value) and for the 
definition, analysis and prediction of out of stocks 
we track a methodology that allows to switch 
from business intelligence to data mining. On the 
research side, we defined a methodology for the 
qualitative and quantitative analysis of multi-class 
classifiers with ordinal classes. This seems to be 
essentially an open problem: literature does not 
explore convincing and consolidated solutions, 
although such cases are particularly frequent in 
the analysis of social phenomena, in which dis- 
cretization of continuous values if often applied. 

Despite the difficulty of the problem in terms 
of quantity and quality of data, excellent results 
were achieved both regarding the prevision of 
promotional sales, and the analysis of the OOS 
phenomenon. Beside positive results several lines 
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of developed clearly emerged. Especially concern- 
ing the analysis and quantitative evaluation of the 
economical impact that our approach can have 
in practice. While preliminary basic estimates 
are encouraging, more precise solutions to the 
problem are needed for an effective usability of 
the overall solution within organizations. On the 
data mining side, several technical issues emerged. 
including the integration of sales forecasting and 
OOS prediction in a unique predictive model, and 
considering the correlations that may exist between 
the various products in promotion and the others 
not included in a what-if scenario. 
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KEY TERMS AND DEFINITIONS 


Classification Model: Classification is a 
procedure in which individual items are placed 
into groups based on quantitative information on 
one or more characteristics inherent in the items 
(referred to as traits, variables, characters, etc) and 
based on a training set of previously labeled items. 

Classification Rule: Classification Rule is a 
popular and well researched method for discov- 
ering interesting relations between variables in 
large databases. 


243 


Data Mining: Data mining is the process of 
extracting hidden patterns from data. Data mining 
identifies trends within data that go beyond simple 
data analysis, through the use of sophisticated 
algorithms. 

Discretization: Discretization concerns the 
process of transferring continuous models and 
equations into discrete counterparts. 

Multi-Class Classification: Multi-class Clas- 
sification is a kind of Classification in which there 
are strictly more than two groups. 
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Forecast Analysis for Sales in Large-Scale Retail Trade 


Out of Stock: Out of Stock is the event a 
store is found out of products to sell before the 
promotion is finished. 

Sales Forecasting: Forecasting is the pro- 
cess of estimation in unknown situations. Sales 
Forecasting is used in the practice of Customer 
Demand Planning in everyday business forecast- 
ing for manufacturing companies. 
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Chapter 13 


Preparing for New Competition 
in the Retail Industry 


Goran Klepac 
Raiffeisen Bank Austria, Croatia 


ABSTRACT 


A business case presents aretail company facing new competitors and consequently preparing a customer 
retention strategy. The business environment in which the company was operating prior to the arrival of 
new competitors can be described as a stable market. Bearing in mind the plans and marketing activities 
of a competitor retail chain and making use of the data mining methods a system is being devised for the 
purpose of preventing or at least buffering the churn trend. Development of an early warning indicator 
system based on data mining methods is also being described as a support to the management in early 
detection of both market opportunities and threats. Research in data mining could also be concentrated 
on applying existing data mining techniques to find the best solution regarding practical business prob- 
lems in the public or private sector. Knowledge regarding how some business cases were solved using 
data mining techniques could contribute in a better understanding of the nature or data mining nature 
and help solve specific business issues. 


INTRODUCTION 


In a turbulent business environment with fierce 
competition, getting new customers on board and 
selling a product oraservice is more than achallenge 
(Berry 1997, Berry 2000, Giudici 2003). Companies 
usually resort to intense campaigns, which along 
with campaign costs sometimes include price reduc- 
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tions below the current market prices and/or quality 
improvements Klepac (2006). 

Once the sales goals have been achieved and 
marketing and product development costs covered, 
anew battle geared towards keeping existing users 
of products and services starts. Reasons for customer 
churn are diverse. They range from the unexpected 
moves of competitors trying to gain a bigger piece 
of market share by using swift campaigns (possi- 
bly directly endangering your company’s market 
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position) to unsatisfied clients suddenly start- 
ing to churn (Berry,2000). High initial costs of 
new products development as well as campaign 
expenses are seen as investments with the aim 
of gaining bigger market share. In some cases, 
the success can be measured by the time-span of 
usage of the advertised product by the customer, 
which eventually decreases the unit investment 
cost per product Berry (2000). 


BACKGROUND 


There are numerous case studies regarding how 
to recognise and reduce churn in retail business 
(Berry 2000, Berry 2003, Giudici 2003). Data 
mining methods which could be of use for this 
purpose could be logistic regression (Larose, 
2005; Larose, 2006), survival models Berry 
(2003), Neural networks Alexander (1995), and 
self-organizing maps by Kohonen (2001). 

Ifwe review case studies for churn recognition 
(Berry 2000, Berry 2003, Giudici 2003, Namid 
2004) there are numerous different solutions 
regarding business environment, market condi- 
tions, industry, dominant customer segments etc. 
Sometimes the source of data or data organization 
has an influence on the final data mining solution 
Agosta (2000). The business case presented in this 
chapter gives a potential solution of a business 
case related to wholesale and retail business in 
the situation where incoming competitors enter 
into a relatively peaceful market environment. In 
such a case, the following temporal market chap- 
ter provides a solution applied in practice which 
demonstrates good results after implementation. 


How to Fight Competition 


Trgovina is a wholesale and retail business, 
owning approximately thirty retail stores (su- 
permarkets) across Croatia, as well as three 
wholesale stores whose main purpose is to 
provide goods to the retail stores, but also to 
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sell goods to other legal persons in the retail 
and wholesale business. 

Trgovina deals in consumer goods and it owns 
a central transactional database (Oracle) system. 
integrating all stores. The main transactional data 
stored in the information system consist of data 
sets from each filled out invoice related to codes of 
goods sold (categorised in the system), quantities 
bought, time of issuing the invoice and all other 
legally relevant elements necessary for creating 
a turnover book. Two years ago, the aforemen- 
tioned company started a loyalty cards program. 
Within the program, customers collected points 
and, having collected a certain number of points. 
they earned the right to buy products at discount 
prices and periodically they were awarded gifts 
depending on the number of points collected. 

The customers were obliged to provide the fol- 
lowing data into the loyalty card application form 


° Name and surname 

° Full address 

° Year of birth 

° Family status (married or not) 

° Number of children 

e Education level 

e  Categorised hobbies (sport, arts, etc.). 


As a response to one competitor’s plans to 
increase the number of its retail stores in a sub- 
stantial number of regions where Trgovina’s retail 
stores are present, the company has to increase 
the loyalty of existing customers and acquire new 
ones with the tendency of keeping them even in 
conditions when competition starts operating in the 
neighbourhood of some of the existing Trgovina 
retail stores. 

The company is well aware of the fact that 
the initial strategy of its competitors to acquire 
business in a certain region involves attracting as 
many customers as possible and ensuring their 
long-term loyalty. It is expected that, in order to 
achieve this goal, the competition will be willing 
to invest a certain amount of money, resulting in 
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very low and competitive prices, other benefits 
and a strategy aimed at targeted market segments. 
Based on the previous analyses, Trgovina has 
segmented its market and performed an analysis 
of typical profiles and behaviour models for each 
market segment at the regional level. 

Since the onset of fierce market competition is 
inevitable, during which the competition will try 
to acquire and keep as many customers of other 
retail stores as possible, it is necessary to devise a 
strategy for keeping as many existing customers in 
given circumstances. On the one hand, it is obvious 
that Trgovina will have certain costs related to ac- 
complishing this goal. In all probability, a number 
of customers will defect to the competition. It is of 
vital importance to distinguish which customers 
will go to the competition, i.e. it is crucial that the 
resources spent in keeping the existing customers 
are invested to keep the most important customers. 
The competition probably counts on the attrac- 
tiveness factor which, multiplied by curiosity, is 
certainly on their side. Regardless of the loyalty 
level, it is very probable that a great number of 
customers loyal to Trgovina will make a certain 
number of purchases in the competitors store, 
but the aim is to bring back as many customers 
as possible. This fact will probably significantly 
decrease Trgovina’s income during the estimated 
maximum period of two months where competi- 
tors stake their ground in a new region. After the 
two months, a period of stabilisation is expected 
mostly due to the planned analysis. 

An additional latent danger present in the de- 
scribed situation is the threat of potential long-term 
loss of customers; in this kind of a new market 
situation there is a continued danger of customer 
attrition. Such loss may be caused by well posi- 
tioned competitors campaigns geared towards 
acquiring customers from the other retailers. On 
the other hand, the loss may be caused by the 
increase in the numbers of existing customers 
which are dissatisfied with service provided or 
the extreme differences in prices of some or most 
of the products offered on the shelves. 


Due to this reason, it is necessary to develop 
a monitoring system primarily responsible for 
recording customer churn or stagnation of sales 
amount. Furthermore, it is necessary to make an 
estimated profile of a typical customer prone to 
buying Trgovina goods and to conductthe analysis 
prior to the opening of the competitors store as well 
as two months after the competitors store had been 
opened. With regard to the comparatively stable 
situation in the market over the last five years, 
the estimation of churn based on data from the 
previous period may be questionable, especially 
given that in most of the regions where Trgovina 
stores are located, no other large stores opened, 
with the exception of small shops which posed 
no competitive danger. In order to make a precise 
estimation of churn probability, it is possible to 
conduct classical market research on a sample 
of Trgovina’s customers with the purpose of get- 
ting a clearer picture of possible churns after the 
opening of a competitor store. 


GOALS OF THE ANALYSIS 


The goals of the analysis have to be in line with 
the strategic goals. The main strategic goal is 
to retain existing customers, primarily the high 
quality ones. In accordance with this objective, 
certain activities will be planned with the purpose 
of motivating all customers (especially the highest 
quality customers) to increase their loyalty level 
and continue buying in the Trgovina store. A fur- 
ther strategic goal, observed from the defensive 
point of view, is to monitor the customers with 
the purpose of preventing long-term churn trends. 
These goals are common for data mining projects 
Dresner (2008), Berry (2000), Berry (1997), 
Faulkner (2003). 

In accordance with the strategic goals, the 
first step of the analysis should encompass the 
evaluation of customers and their classification 
in a number of categories, in order to estimate 
the importance of customer to the company by 
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segment. It was decided that only customers with 
loyalty cards will be included in the evaluation 
model, because they comprise 85% of all custom- 
ers. Customers having no loyalty card are impos- 
sible to identify based on past transactions only. 
The obtained results shall yield the customer 
structure classified on the basis of their importance 
for the company. Taking into account the results, 
the following steps will be considering further 
analytical procedures with the aim of increasing 
customer loyalty and monitoring the customers 
during the period of entry of competition into 
the region with the purpose of maintaining the 
market share. The monitoring should serve as a 
decision support system for the decision makers 
in conditions when the competition exercises a 
more aggressive market approach. 

Since retail stores are dispersed throughout 
Croatia, the analyses shall encompass microseg- 
ments, i.e. the market shall be analysed from the 
regional perspective. The idea is to keep as many 
existing customers, especially those who are of 
vital importance for the company, to acquire new 
customers if possible, and to prepare for the first 
wave of competitive campaigns using customer- 
related monitoring systems with the possibility 
of ad hoc analyses in case of massive customer 
churn that would serve as means of preventing 
those trends. Once the market has stabilised, the 
aims of analysis may have a proactive structure 
and be aimed at acquiring competitor’s customers. 


DEVELOPMENT OF THE 
CONCEPTUAL SOLUTION MODEL 


In literature there are numerous solution regarding 
churn analysis (Berry 2000, Berry 1997, Namid 
2004). The presented case is specific because of 
expected churn trends, and the fact that it describes 
a different approach to problem solution. 

Firstly we need to recognize customers with 
high perceived customer value, and after that we 
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should develop a churn monitoring system. The 
conceptual solution model consists of two bas 
segments. The first segment should evaluat: 
customers on a regional basis using the scoring 
model. Customers from each region should be 
evaluated in several categories based on expen 
knowledge. In this manner better insight into ces- 
tomer behaviour based on their importance to te 
company should be gained. Since the goal of th 
company is to retain as many clients as possib: 
(especially the most important ones) a strategy 
shall be devised based on the results of the analysis 
with the purpose of increasing customer loyalty 

The competition will inevitably attract a certain 
number of customers, so the second segment of the 
solution model will focus on monitoring customer 
churn or stagnation of sales Trgovina retail stores. 
With regard to the stable market conditions in the 
past and almost negligible percentage of customer 
churn, which makes the analysis more difficult, a 
classical market research should be performed in 
the beginning with the aid of a market research 
agency. This research should provide answers 
to questions such as which market segments 
of existing customers of Trgovina stores on a 
regional basis would be prone to start buying 
from competition and under which conditions. 
As well as what would be their prime reasons for 
continuous buying in competitor’s stores, and what 
would motivate them to continue buying from the 
Trgovina retail stores. 

Allthe aforementioned information could serve 
as a basis for devising a strategy for successful 
resistance during the period of competitor’s intense 
and aggressive advertising campaigns at the time 
of their entering the market and to their attempts to 
win over the customers. Since it is almost certain 
that the competition shall constantly and suddenly 
undertake advertising activities with the aim of 
taking over a number of customers, a system of 
permanent monitoring should be developed with 
the purpose of early diagnosis of churn trends 
in certain market segments, with the possibility 
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Table 1. A proposition of table for monitoring the promotional activities 


Date of the campaign 


10.06.2006 
10.07.2006 


of analysing their dominant characteristics. To 
this aim, the so-called survival models would be 
implemented. 

Having in mind that each future aggressive 
advertising campaign of the competition poses 
a threat of losing a number of customers, one 
should also note that this present an opportunity for 
conducting additional analyses which would help 
the Trgovina company to recognise the behaviour 
of its customers and promptly react to reduce 
the consequences of future market actions of the 
competition. For example, if a sudden competi- 
tor’s advertising campaign manages to win over a 
portion of customers having a common attribute, 
it is an outcome which Trgovina may use to its 
advantage in order to aim its own campaign to- 
wards winning those customers back (with respect 
to motivating factors of that market segment) and, 
if possible, attract competitor’s customers. For 
the purposes of future analyses, it is possible to 
store the data related to competitor’s campaigns 
and their main attributes, as shown in Table 1. 

If the data from such tables are paired with the 
results obtained using the survival models, which 
can provide us with common attributes of custom- 
ers who churned or decreased the cooperation, it 
is possible to obtain additional information re- 
lated to the character of competitor’s campaigns. 
Based on the accumulated knowledge of possible 
motivators of a certain market segment, Trgovina 
may plan its own campaigns at a regional level. 
For instance, if a decrease is observed in pur- 
chases made by male clients (data known from 
loyalty cards), which are regular customers with 
relatively high turnover per invoice in previous 


Tools 


14 days 


transactions, and ifthat stagnation of sales amount 
correlates with the period of competitor’s promo- 
tion of new assortment of tools, this data can be 
very useful for planning a campaign aimed at 
returning the customers and attracting new ones. 

First of all, it is important to conduct the analy- 
sis of the new tool selection. Is buying those tools 
a matter of prestige, or are the tools sold at a very 
good price but their quality is not guaranteed, or 
are these the high quality tools, with good price, 
but their purchasing is not a matter of prestige... 
Taking into account all the variables, it is possible 
to launch a campaign with the goal of returning 
churned customers. One must always bear in 
mind that it is much cheaper to prevent situa- 
tions of possible customer churn, than to try and 
win customers back. Concerning fierce market 
competition, it is advisable to develop an action 
plan for such situations as well. 

Conceptual solution models are idea generators 
and present a beginning of any serious analysis. 
They also serve asa recapitulation of goals, primar- 
ily from the perspective of available methodology 
and data. Conceptual solution models mostly do 
not represent the final model solutions which may 
be considered as a finished project task, because 
the flow of analysis itself and the obtained results 
guide further analytical processes, while the goals 
of the analysis are clearly defined. Having devel- 
oped the conceptual solution model, where the 
methodology of conducting the analysis is clearly 
profiled, the following step is to select software 
for performing the needed analyses and creating 
the analytical models. 
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THE DEVELOPMENT OF 
SCORING MODEL 


There are two dominant planned types of analysis 
in this case—the scoring analysis and the customer 
churn analysis. The scoring model may be devel- 
oped with the aid of fuzzy expert systems, based 
on the rules resulting from expert knowledge. 

Since the goal of the first step of the analysis is 
the evaluation of customers based on the scoring 
model at the level of each region, it is necessary 
to define the possible sub-goals of analysis which 
are defined, in this business case, as measuring 
the level of client loyalty. In the development of 
scoring models based on fuzzy logic, it is crucial 
to hire business experts as team members who 
are active in the development (Aracil, 2000; 
Pedrycz,1998;Siler, 2005). The work of the team 
is coordinated by an analyst or consultant who is 
at the same time in charge of the entire project. 
The Trgovina team consisted ofa consultant, sales 
executive and his assistant, marketing executive 
and chief information system architect. 

The consultant’s task was to coordinate and 
lead the team and to develop the scoring model. 
The sales executive and his assistant worked with 
the marketing executive and they were in charge 
of creating business rules and defining the key 
indicators for scoring (with the help of the con- 
sultant). The chief information system architect of 
Trgovina gave suggestions and opinions related to 
the existence of the data in the database based on 
the defined key indicators. He was also in charge 
of creating the documentation used for subsequent 
pre-processing of the data for the needs of the scor- 
ing model. During the final stage of the project, 
the chief information system architect worked 
with the consultant to create the solution for the 
integration of the scoring model into the existing 
information system. Alongside the mentioned 
team members, a certain number of programmers 
also worked on the project. They developed the 
ETL solutions for the scoring model based on 
the specification and took part in the operative 
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segment of the integration of the scoring mode! 
into the existing information system based on the 
documentation created. 


Interviewing the Users 


Successful interviewing of users is one of the 
crucial factors on which the overall success of 
scoring projects based on fuzzy expert systems 
depends. During the user interviews it is essentia 
to have in mind the desired goal of scoring. In the 
case of Trgovina, it is the evaluation of custom- 
ers in retail stores. The interviews are conducted 
by the consultant, or the person in charge of the 
development of the model itself. 

It is crucial to ask the questions that will be 
used to recognize the key indicators and categories 
relevant for scoring. Based on the interviews, we 
obtain a clearer picture of the users’ perception 
of a problem we want to solve, i.e. in this case to 
model it using the fuzzy expert systems. Some of 
the questions put to the Trgovina were: 


e Name at least three categories (e.g. prom- 
ising, profitable, loyal...) based on which 
you can evaluate each of your customers. 

° Is loyalty as category relevant for the 
evaluation of customers (scoring) in your 
company? 

° On which grounds (which procedures, be- 
haviour models) may loyalty of a customer 
be evaluated within your company? 

° Which indicators (e.g. turnover, difference 
in turnover per invoice, campaign costs...) 
are relevant for estimation of customer 
profitability within your company? 


The questions listed here are just an example 
of questions put to the auditorium during the 
brainstorming process, the goal of which was to 
recognize the key categories and indicators for 
building a scoring model. In this business case, as 
well as in all other business cases where scoring 
models are developed, the consultant (or person in 
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charge of the development of the scoring model) 
plays a key role in conducting the interview in the 
right manner. During the interview, this person 
obtains key information relevant for establishing 
a basic version of the model. 

The described business case defines a situation 
typical in cases where some team members belong 
to a business and other to a technical sector. Team 
members belonging to the business sector made 
their suggestions disregarding the fact that some 
suggested indicators would take a long time to 
pre-process regarding the scope of the data and 
complexity of the procedure of algorithm pro- 
cessing. The presence of the chief information 
system architect was very useful here, since he 
suggested considering other solution modalities. 
On the other hand, the chief information system 
architect questioned the suggestions made by team 
members belonging to the business sector related 
to using indicators which were important in their 
opinion but were stored in databases other than 
the main one. The consultant had a prominent 
role in the motivation of team members to reach 
a compromise acceptable from the perspective of 
model development which does not diminish its 
plausibility, at the same time having respect of 
the technical limitations and difficulties resulting 
from the database architecture. 


Defining Key Indicators 


The primary goal of the interviewing process was 
to define key indicators and basic categories as 
structural elements of a fuzzy expert system. Key 
indicators may be defined as basic variables, which 
provide input to a fuzzy expert system (Aracil, 
2000; Pedrycz,1998;Siler, 2005). The categories 
consist of more abstract notions defined using the 
key indicators. For example, key indicators are 
sales revenue, campaign costs, duration of busi- 
ness relationship. Based on these key indicators, 
a category of profitability is defined and limited 
with a set of rules. 


Based on business experience and with the help 
of the consultant, Trgovina recognised during the 
interview the three main categories which had an 
immediate influence on the client scoring: 


e Client profitability 
° Client loyalty 
° Client outlook 


Each of the categories should be defined us- 
ing key indicators, developed on the grounds of 
available databases. These indicators are input 
parameters for a fuzzy expert system model. As 
aresult of brainstorming during the interview, the 
key indicators for each category (at the customer 
level) are defined as illustrated in Table 2. 

Having finished the brainstorming session, the 
expert team agreed that the selected indicators 
best describe the chosen categories. Since the 
internal company experts know their customers 
the best, based on their expert knowledge, they 
estimate which indicators in the databases are the 
best to describe the chosen categories further used 
for evaluation of scoring. The consultant’s task 
during the brainstorming (as it was the case in 
this example) is to guide the experts to select the 
most significant indicators describing the chosen 
categories. During this process, the number of 
key indicators defining a category must be taken 
into account. Sometimes this involves the addi- 
tional selection of the most important indicators 
among the important ones in order to keep a 
number of indicators allocated to a category under 
four or five. The reason for such a reduction of 
indicators stems from limited human perception. 
Each indicator comprises a body of a rule. 

When more than five indicators comprising 
the rule conditions are present (and each indicator 
may have anumber of subcategories), the process 
of defining rules is often burdened with more than 
five conditions in addition to the inevitable, often 
large number of rules. Under such circumstances, 
it is extremely hard to define the rules. If more 
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Table 2. The definition of key indicators 


Cross selling trends within the last two quarters 


Client loyalty 


Table 3. A process of key indicator calculation 


Category: Client profitability 


Total sales price difference in the last six months 


Calculation process 


Market value— (purchase value + rebate + dependent costs) summed 
for all invoices within last six months per each customer. Unit — 
Croatian kuna. 


Total sum of distributed promotional materials and gifts obtained 
based on the loyalty card data for the period of last six months per 
each customer. Unit — Croatian kuna. 


Total promotion expenses based on the loyalty card program (gifts) 
within the last six months 


Total expenses based on the advertising campaigns within the last 
six months 


Total sum of expenses based on each campaign executed during the 
last six months estimated per each customer (global campaigns on 
regional level promoted in the press and on the television + targeted 
campaigns in the form of direct mailing to selected market segments 
containing product samples). Unit — Croatian kuna. 


than five indicators must inevitably be used, it 
is possible to introduce more categories into the 
system. This simplifies the process of defining the 
rules by the experts, and the expert system itself 
becomes more transparent and easier-to-survey. 


Defining the Preprocessing 
Algorithms 


In most cases, itis necessary to deduce the defined 
key indicators serving as an entrance to the fuzzy 
expert system model on the basis of the available 
data, because they do not exist in the transactional 
database in a form defined as entry into a fuzzy 
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expert system. For example, in order to obtain a 
key indicator number of visits within the last six 
months, which is an elementused in the estimation 
of loyalty to the mentioned company, an algo- 
rithmic procedure for reconstruction of number 
of visits within the last six months based on the 
invoice numbers had to be defined. 

After discovering the key indicators, the 
consultant and the chief information system 
architect created the documentation for the pre- 
processing of the data, encompassing all defined 
categories and their key indicators. Documenta- 
tion pertaining to the profitability category is 
shown in Table 3. 
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Figure 1. Fuzzy scoring model 
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Using the documentation in such a form, the 
chief information system architect developed 
detailed algorithms for each indicator. Based on 
these algorithms the team of programmers cre- 
ated the application for data pre-processing. The 
data pre-processing is much easier to perform 
when a data warehouse exists. In the described 
case, PL/SQL was used for the data pre-process- 
ing, which resulted in the creation of a table of 
key indicators for each customer. 


Structural Model Development 


After defining the key indicators and the categories 
and subcategories during the interview, the basis 
of a fuzzy scoring model was defined. Having 
defined the categories, it was necessary to unite 
them in the form of a model depicted in Figure 1. 

The depicted model was developed using the 
FuzzyTech program package. The key indicators 
in the model comprise the input variables for rule 
blocks. Each rules block estimates the output 
value of a category based on the defined rules. 
These output category values enter the block of 
rules for final scoring, where the final scoring is 
estimated on the basis of a defined set of rules. 
The rules were defined by the experts, and the 
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consultant entered the so defined rules into the 
fuzzy model. As it can be observed in the figure, 
the method used for the defuzzification is the so- 
called Mean-of-Maximum Method (MoM). The 
reason why it was used originates from the fact 
that the output results obtained using this de- 
fuzzification method are very easy to interpret, 
especially when such models are integrated in the 
form of an applicative solution. 


Defining the Key Indicator Range 


Classical logic, allowing only strict limits among 
classes, is much further from human perception 
mechanisms that it is the case with fuzzy logic. For 
example, if we define the key indicator ‘number 
of visits’ resorting to classical logic, we would 
categorize a small number of visits in the class 
of 0 to 20 visits during the last six months. Of 
course, we can ask ourselves what happens with 
a customer who visited the retail store 21 times 
during the last six months. If we apply the mecha- 
nisms of classical logic, he would be placed in the 
subsequent class. Human perception mechanisms 
function on significantly different premises, and 
are more liberal when it comes to such clas- 
sification. The limit of a small number of visits 
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Figure 2. Defining the range of key indicator frequency of purchasing 
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observed through the eyes of human perception 
mechanisms is subject to tolerance and deviation 
which are often subjective in nature. 

In order to make categorization systems (or 
scoring systems, as it is the case here) as similar 
to the human process of making decisions and 
having in mind the limitations arising from the 
classical logic and its application, fuzzy expert 
systems were used in the development of scor- 
ing models for the Trgovina. These systems en- 
able defining the range of key indicators on the 
premise of fuzzy logic, which is very close to the 
human way of thinking and categorizing. Figure 2 
illustrates a manner of defining the range of key 
indicator frequency of purchasing within the last 
six months calculated by counting the number of 
visits during the last six months. 

The frequency of purchasing is defined as low, 
medium or high based on the input parameters. 
The calculated values are further processed using 
the defined expert rules. The expert team defined 
a number of fuzzy classes for each key indicator, 
as well as their titles and ranges. After that step, 
they were defined in FuzzyTech as it is shown in 
the example of key indicator frequency of pur- 
chasing within the last six months. The key indi- 
cators were defined in such a manner as were the 
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ranges of output variables in the model and became 
the basic elements for forming the rules. 


Defining the Rule System 


One common error present during the implemen- 
tation of the scoring system using fuzzy exper 
systems is neglecting the role of the number of 
indicators describing a category and number 
of fuzzy classes defined by the expert team 

The same happened during the development of 
such a system in the mentioned company. Dur- 
ing the early phases of system development, ix 
the interviewing process, the expert team oftes 
emphasized the importance of a large number of 
indicators for describing a certain category. The 
consultant played a key role here, since he had tc 
limit the expert team to define a maximum of four 
or five key indicators best describing a category 

Otherwise, a combinatory explosion of a number 
of rules may occur. The following formula is used 
for the prediction of total number of rules in a 
system Klepac (2006): 


n 
p,=[[7.m1, 
i=l 
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Table 4. Illustration of rule number growth 


Number of fuzzy 
categories of the second 
key indicator 


Number of fuzzy 
categories of the first 
key indicator 


Number of fuzzy 
categories of the third 
key indicator 


Number of fuzzy 
categories of the fourth 
key indicator 


Number of rules in a 
block 


Low 


where p, denotes a number of rules in block j, and 
is calculated as a product of a number of fuzzy 
categories of each i-th key indicator r. Based on 
that, the total number of rules in a system is cal- 
culated using the following formula: 


P=)0p,: 
j=l 


where P denotes a total number of rules in a sys- 
tem, calculated as a sum of number of rules in 
all rules blocks of the system. Table 4 illustrates 
the relation of number of rules in a block, with 
four key indicators and different number of fuzzy 
categories defined within the key indicators. 
Inthe table, we can observe a tendency towards 
growth in the number of rules in rule blocks with 
the increase in number of fuzzy categories. In 
addition to that, we can observe that the increase 


Medium 
High 


.00 
1.00 


Medium 


Medium 


innumberofrules is also influenced by the increase 
in number of key indicators. The control of the 
total number of rules in a fuzzy expert system 
may be exercised through reduction of the num- 
ber of key indicators or reduction of the number 
of fuzzy categories within key indicators. Such 
methodology helped Trgovina to optimize the 
number of expert rules within their system. 

After reaching a consensus regarding the 
number of rules, the experts created the rules for 
the system, using the form in Table 5 depicting 
the key indicator loyalty. 

In this manner, the experts defined the rules, 
for each category defined, as well as for the final 
scoring. 


Scoring 


Having created the model and performed the data 
pre-processing, as well as having defined the rules, 
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Table 6. Relation table containing the results of the performed analysis 


Table 7. Relative structure of customer shares 
based on the scoring categories 


(eae cabal 


Very high 


the scoring of pre-processed data followed. The 
analysis was performed in three main stages. The 
first stage was characterized by conducting scor- 
ing using the created fuzzy expert system model, 
from which a sampling of obtained results was 
performed with the purpose of diagnosing the er- 
rors in the model. After several cycles of running 
the data through the model and controlling its 
reliability, the model has been declared reliable. 
The following stage, scoring and accepting the 
results obtained from the model, was performed. 
Based on the scoring results for each region, a 
relation table (Table 6) was created. 

Further analysis at the level of each region 
was aimed at finding the relative structure of 
customer shares based on scoring. As a result of 
the analysis within one of the regions, the data in 
Table 7 was obtained. 

Theclient structure shows that 49% of custom- 
ers were classified in the high and very high 
scoring ranges. These customers are of great 
importance for the company. Further analyses 
established that 80% of the customers belonging 
to scoring category High obtained the value based 
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on the high profitability and bright outlook, bwt 
their loyalty is Medium. This fact placed them is 
scoring category High instead of Very high. 

Speaking from the business point of view, this 
is a very important segment of clients and it is 
crucial to significantly increase their loyalty leve 
in the future in order to successfully overcome the 
new situation on the market. Further analysis of 
the data using decision trees pointed out the fact 
that this category primarily consist of customers 
who rarely come to the store and rarely respond te 
campaigns that help them collect a large number 
of points on loyalty cards if they buy products 
advertised in the campaigns aimed at loyalty card 
holders. It was also noticed that this category of 
customers prefers a smaller number of visits to 
the store during one month, but the amounts on 
their invoices are higher than average for evers 
purchase. Market basket analysis and clustering 
were performed using the data related to that group 
of customers. Based on the analyses, two main 
customer segments were defined. 


“Healthy Life” Segment 


Customers who belong to this segment usually 
buy large amounts of fruit and vegetables, wines 
of better quality and ingredients related to specific 
cuisines (Chinese, Japanese, Indian). They rarely 
buy meat and meat products (even then in smal! 
quantities), but they buy more milk and milk prod- 
ucts than usual. Based on the data from loyalty 
cards, this segment mostly consists of members 
of younger age groups and higher education. 
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“Cleanliness is Next to 
Godliness” Segment 


Customers who belong to this segment usually 
buy home cleaning products and detergents. Some 
70% of articles in their basket are related to the 
mentioned category and the remaining 30% is 
food. The representatives of this segment do not 
exhibit any regularity in buying foodstuffs. Based 
on the data from loyalty cards, this segment mostly 
consists of middle aged persons, who are married 
with children. Further analyses showed that both 
segments mostly shop at the end of the week or 
during the weekends. 

All these data were used in planning of further 
promotional activities in the region. Based on the 
information, a certain number of promotional 
activities were directed towards the mentioned 
market segments. There were more promotional 
activities offering fruit and vegetables at reduced 
prices and an extra discount was given for buying 
certain wines with fruit and vegetables or milk. 
Such purchases were rewarded with more points 
on loyalty cards. Customers who took part in 
a certain number of promotional purchases of 
this type entered a competition for a vacation in 
an exotic destination. A number of promotional 
activities were directed to the second market seg- 
ment, and these included buying home cleaning 
products and detergents at discount prices. This 
type of promotional purchase also included getting 
more points on loyalty cards. Customers who took 
part in a certain number of promotional purchases 
of this type entered a competition for complete 
child-size furniture for a nursery. 

After conducting the described promotional 
activities, 60% of clients who scored as High 
increased their loyalty level and scored as Very 
high during the following five months. Such 
analyses were conducted for each region. Each 
region exhibited its own specificities partially 
discovered after scoring and partially after ad- 
ditional analyses. 


The conducted analysis gains extra value ifthe 
results are observed in different time intervals. 
Based on that, we can obtain information on market 
differentiation and the success of past campaigns, 
the goal of which was to increase the number of 
customers with specified scoring values. This data 
may serve as basis for calculating the amount of 
resources invested per client in order to upgrade his 
scoring value from High to Very high. Monitoring 
of customers in this manner may also serve as a 
tool for monitoring customers’ activities across 
different segments. It can yield very transparent 
information on a potential decrease in customers’ 
activities on the market segment level 

Monitoring the market segments over time 
provides us with a new dimension of analytical 
data overview. General trends of decreasing or 
increasing the volume of market structure provide 
us with guidelines for performing further analytical 
procedures. These analytical procedures should 
answer questions such as why these trends occur, 
what are their causes, how to prevent the trends 
if they are bad for business, or how to strengthen 
them if they are good for business. For example, 
if we observe a 30% increase in the number of 
customers who acquired scoring value “Very 
high” between two quarters, and if their scoring 
value was “High” during the previous period, it 
is necessary to discover the reason that lead to the 
change in their status. Besides that, it is necessary 
to conduct further analysis and discover potential 
regularities within the population of such custom- 
ers. These insights can be used to motivate other 
customers from “High” scoring category who 
are the most similar to this population to achieve 
the scoring value “Very high” in a certain period 
of time. 

The insights regarding the regularities within 
a population which can be recognized using data 
mining techniques can be useful in developing a 
strategy for promotional activities with the aim of 
increasing the value of clients from the perspective 
of a sales-oriented company. 
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Integration of Scoring Models into 
the Existing Information System 


The implementation of a scoring model based 
on a fuzzy expert system is mostly performed in 
two main steps. The first step consists of scoring 
based on the basic model. In such a model there 
is no applicative solution based on the designed 
fuzzy expert system, butthe analyst performs most 
of the scoring based on the created model. This 
step is also characterized by the procedures for 
evaluation of the model, so the model is subject 
to frequent changes in this phase. 

The second step, characterized by stability of 
the created model is oriented towards finding and 
developing amore permanentapplicative solution, 
which is based on the created model. 

The same happened during the implementa- 
tion of the model to Trgovina. After a period of 
intensive testing and analysis of the scoring, an 
applicative solution which should be integrated 
into the existing information system was needed. 
Such a solution had to meet some basic criteria 
in its final phase: 


e The possibility of viewing scoring results 
for each client with explanation why of the 
client was allocated to a certain scoring 
category 

e The possibility of running an automated 
scoring procedure over the entire customer 
database, with the possibility of saving the 
scoring values 

e The possibility of retrieving historical 
scoring results for each customer. 


The integration of the fuzzy model and the 
existing information system was performed with 
the aid of ActiveX objects. Figure 3 depicts a 
part of that solution related to the evaluation of 
customer loyalty. 

Besides monitoring the individual categories 
(the figure illustrates the example of loyalty cat- 
egory), the scoring problem is also solved in a 
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Figure 3. Application used for evaluation œ 
customer loyalty 
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manner that encompasses the values ofall catego- 
ries, together with the score value. The applicative 
solution with such a concept provides the end user 
with the derived information that resulted from 
the estimation made by the expert system. Ap- 
plicative solutions designed in this manner may 
be a part of a CRM system, or even a central 
module of a CRM system. 


DEVELOPMENT OF CUSTOMER 
CHURN ANALYSIS MODEL 


Definition of Customer 
Churn Analysis Strategy in 
the Trgovina Company 


Having performed the scoring of its customers with 
respect to the expected activities of its competition, 
Trgovina has to establish a system for permanent 
monitoring of customer churn and a decrease in 
the purchase activities of existing customers. 
Since the market was stable prior to the arrival 
of the competition when it comes to the intensity 
and volume of purchase, the first step of evalu- 
ation of possible customer churn encompassed 
market research performed before the competition 
started its business on the representative sample of 
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customers using telephone interviewing method. 
This research roughly indicated that the existing 
Trgovina customers would be strongly motivated 
to defect to the competition if the competition 
offered significantly lower prices. The results 
of the research suggest that more than 60% of 
Trgovina customers in almost all regions had the 
intent to make at least one purchase in competi- 
tor store once it opened. Although troublesome, 
this information was expected and it was used 

in developing a strategy for retaining the exist- 

ag population of customers. After they perform 

the initial purchase in the competitor’s store, the 

customers must stay loyal to Trgovina. During 
the last eight years the market was stable and the 
customer churn in Trgovina was negligible. 

Based on data from the company it is very 
difficult, almost impossible, to build a transparent 
prediction model of customer churn, especially 
if the expected market turbulences are taken into 
account. One segment of the strategy aimed at 
retaining the existing customers was to add extra 
points to loyalty cards in accordance with their 
scoring category and to inform the customers of 
the number of points based on which they can buy 
products at cheaper prices. The points were added 
based on the scoring category. For example, the 
customers labelled “Very high “were given more 
points than the customers with the “High” scor- 
ing category. The final goal is to keep as many 
existing customers, especially those labelled as 
desirable for the company. As opposed to e.g. the 
telecommunication sector, where customer churn 
is clearly defined through the moment of breaking 
a contract, the moment of ending cooperation in 
retail is not clearly defined. Thus the expert team 
had to define the moment of customer churn in 
retail. 

Based on the market research results the 
management expected a decrease in turnover 
and number of customers in stores during the 
first month after opening the competition retail 
stores in regions where they were opened. This 
was partially due to the curiosity of customers 


and partially to the aggressive campaign of the 
competition. The customer churn in Trgovina was 
defined for each region as the absence ofan existing 
customer from the store within two months after 
competitor retail store was opened in the region 
or if an existing customer decreased the amount 
of monthly purchase by an average 60% within 
two months in comparison to the previous three 
months. It was decided that the customer churn 
analysis will be conducted on a monthly basis 
even before the end of the two month period with 
the aim of predicting the customer churn in future 
period. After two months, acomprehensive analy- 
sis of customer churn will be conducted, which 
should provide the guidelines for further strategic 
planning. Based on the mentioned analyses and 
competition’s moves, precise actions aimed at 
retaining customers will be defined. 

In order to achieve market advantage over the 
competition even before they enter the market, the 
plan is to increase the loyalty of existing custom- 
ers by adding points to their loyalty cards so that 
they can buy products at discount prices. As a 
measure of precaution, a couple of weeks before 
the competition opens the stores promotional cam- 
paigns based on the reduction of prices of some 
products will be intensified as well as advertising 
in local media. From the perspective of Trgovina, 
the fact that the competition will not open stores 
in all regions simultaneously, but within intervals 
of several months is good news. This will make 
the development of the strategy for retaining 
customers in regions where the competition will 
open stores later on easier, based on the strategic 
patterns discovered in other regions. 


Preprocessing of the Data 
for Survival Models 


In order to analyze customer churn in Trgovina, 
Cox regression was used Berry (2003). The ad- 
vantage of this method is the possibility to include 
predictive variables (covariance) into the model. 
The goal of the analysis was to discover not only 
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Figure 4. Different periods of beginning and ending the business cooperation 
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the trend of the customer churn curve, but also 
to estimate the probability of customer churn 
with regard to different customer characteristics, 
which is possible using this model. It is important 
to notice that customers can begin and end their 
cooperation lifecycle with Trgovina during differ- 
ent periods of time, as it is illustrated in Figure 4. 

For the purpose of the evaluation of customer 
churn, it is necessary to define discrete time in- 
tervals (days, months, years...). The basic idea 
of a survival model boils down to the estimation 
of probability that someone who has “survived” 
as a Trgovina customer during a certain period of 
time will either stop or continue purchasing dur- 
ing the following period. Cox regression answered 
the question regarding the probability of cus- 
tomer churn with respect to their different attri- 
butes. 

For the purpose of creating the model, it is 
important to define the notion of customer churn as 
absence ofan existing customer during the last two 
months after the competition opened a retail store 
in the region or if an existing customer decreased 
the amount of monthly purchase by an average 
of 60% within two months in comparison to the 
previous three months. Based on the definition, a 
pre-processing of the data for the status variable 
was performed. A status designation of 1 denotes 
a churned customer, while a status designation of 
0 denotes an existing customer. 
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Another important variables important for the 
model is the number of months of continuous 
purchase. This variable is defined as the number 
of months since the date of customer churn minus 
number of months since the date of issuing the 
loyalty card. Having in mind the nature of the Cox 
regression, the model includes other attributes 
(predictive variables). Continuous variables. 
such as age variable are made discrete. Data pre- 
processing resulted in the table of the structure 
in Table 8. 

Based on such table, it is possible to perform 
the analysis using Cox regression. 


Performing the Analysis 


Unlike businesses such as telecommunications, 
retail business is characterized by a range of 
specificities, which guide the data adjustment in 
order to build a customer churn model. Besides the 
specificities related to the definition of the moment 
of customer churn, retail business does not have 
the privilege of receiving transparent information 
on a daily basis regarding the relationship of a 
customer and the company defined by a contract. 
Due to that reason, it is much easier to monitor 
the client structure in telecommunications with 
regard to breaking the contracts. 

If we observe the situation from the competi- 
tion’s point of view for a moment, we can notice 
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Table 8. Structure of the data for performing Cox regression 


Number of 
purchasing 
months 


Customer code 


Gender 


No of 
complaints 


23345342 
54336564 


34566334 


that they lack usable data for performing a deeper 
analysis of the structure of clients who decided 
to make a purchase in their store during the first 
month, since they will not issue loyalty cards 
during the first couple of months and having in 
mind the specificities of retail business. 

At the moment of closing the contract, the 
telecommunication companies obtain much more 
information from the client, which may be used 
for analysis. At the same time, client of a retail 
company provides only the data on past transac- 
tions, so he can not be uniquely identified. From 
this point of view, Trgovina has a certain analytical 
advantage, because they own data more suitable 
for analysis acquired from loyalty cards, which 
is important for analyzing the customer churn. 
Currently as entering to the market, telecom- 
munication companies and similar businesses 
that base their operation on contracts closed with 
clients count on a certain percentage of customer 
churn which is manifested in the form of break- 
ing the contracts. Such companies do not have a 
problem with conducting the analysis of reasons 
and profiles of clients who churn based on the 
collected data. 

From the point of view of a competition 
retail company, the analogy of customer churn 
can be compared with the arrival of a customer 
who makes a purchase during the first month 
after acquisition motivated by an aggressive 
advertising campaign, and gives up purchas- 
ing in the competition store after that. It is 
very difficult to uniquely identify a group of 
such customers. This fact almost disables the 


Figure 5. Survival curve for the Trgovina company 
customers 
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competition company to perform the analytical 
procedures in order to influence those trends 
with the purpose of motivating this group of 
customers to further purchasing. In such cases, 
the competition company has to rely on market 
research results obtained through performing 
surveys on samples of customers, which are a 
rather big investment. 

Trgovina performed the analysis of customer 
churn based on pre-processed data using Cox 
regression. In regards to market stability during 
the previous period, the focus of the analysis was 
on the last twenty months of operations. The com- 
petition started its operation in the region since 
the sixteenth month observed within the model. 
After the analysis was performed, the survival 
curve for Trgovina customers was obtained, as 
illustrated in Figure 5. 
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The figure clearly indicates that with regard 
to the population of customer after the sixteenth 
observed month, in a relatively stable market, 
there is a decrease in Trgovina customer popula- 
tion. Translated to model jargon, the probability 
of survival of the customer population is decreas- 
ing. The curve shows that the decrease in promo- 
tional activities of the competition had influence 
on the decrease of customer churn in Trgovina 
(seventeenth and eighteenth month in the model 
on Figure 5. With the intensification of promo- 
tional activities, the trend again started to shift in 
favour of the competition. After the twentieth 
week, using the defined notion of customer churn, 
Trgovina lost 15% of existing customer population 
who were loyalty card holders in the region ob- 
served. 

The results of the analysis provide us with 
guidelines regarding the structure and dynamic 
of the churn, but the results are not sufficient for 
reaching firm conclusions which might be used in 
the process of decision making. For that purpose, 
predictive variables which will provide us with 
a clearer picture of customer churn in Trgovina 
need to be included in the model. Predictive 
variables are included in the table made on the 
basis of data pre-processing. These variables were 
selected based on the attribute relevance analysis. 
A certain number of variables for which it was 
proved that they best describe the variable end 
of churning status based on the calculated gini 
index were included in the data model which was 
further analyzed. 

After the analysis was performed on the 
pre-processed data using Cox regression, it was 
found that certain categories of age structure and 
customers with a defined number of complaints 
filed through the call centre have the biggest 
influence on churn trends. It was also noticed 
that there are no significant differences in churn 
trends among customers classified in different 
scoring categories. 

The results of the analysis are shown in Figure 
6 and Figure 7. 
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Figure 6. Survival function with regard to “com 
plaints” variable 
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variable 
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If we pay closer attention to the graphs and 
compare the intensities of competition advertising 
campaigns that were less intense during the sew- 
enteenth and eighteenth month observed, we cas 
notice that during this period customer churs 
decreased in comparison to the following periods 
and that it increased in the following months 
observed. It can also be noticed that the highes 
probability of churn is connected with younger 
customers, as well as customers who filed more 
than four complaints through the call centre. Based 
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on the analysis, it was concluded that the risk of 
customer churn grows with younger age groups 
as well as with the number of complaints via the 
call centre. 

The probable reason of more frequent churn 
when it comes to younger population stems from 
the fact that the competition advertising campaign 
targets that exact group — their strategic market 
segment. Regarding the more frequent customer 
churn related to clients who filed more than four 
complaints through the call centre, further analysis 
was conducted on that customer population with 
the aim of finding common attributes of customers 

belonging to that population. The analysis did not 
yield results that would enable recognizing the 
dominant significant attributes in that population. 


Business Decisions Based 
on the Analysis 


Diagnosed risky categories for the observed region 
exhibiting a greater tendency towards customer 
churn are younger customers and customers who 
had more that four filed complaints in the call 
centre. Generally speaking, the risk of customer 
churn grows with the diminishing age of customer 
and with the increase in number of complaints. 
Regarding the competition campaign aimed at 
younger customers, Trgovina decided to pay 
special attention to that market segment through 
the advertising campaigns in order to prevent as 
much churn related to younger customers. Special 
attention will be paid to younger customers who 
were categorized based on the performed scoring 
in the “High” and “Very high” scoring categories. 
These groups will be targeted by special promo- 
tional activities in order to additionally motivate 
them to continue purchasing in Trgovina stores. 

Regarding the discovered regularities related 
to customer complaints, the decision was made 
to further analyze the reasons of dissatisfaction 
and contact each client who had filed more than 
two complaints through the call centre with the 
intent of solving them. This is another strategy for 


increasing client loyalty which will also influence 
the reduction of customer churns. 


FUTURE TRENDS IN CHURN 
PREDICTION AND CONCLUSIONS 


Arelatively stable market takes on a significantly 
different character after the arrival of competition 
to certain regions. Trgovina discovered the struc- 
ture of its customers according to their value to the 
company using scoring techniques. This knowl- 
edge resulted in focusing on the most valuable 


customers from the perspective of the company, 

with the purpose of retention and increasing their 

loyalty. Based on the scoring results, an even 
more Neriie policy of discount pices was de- 
veloped for purchasing products from recognized 
categories obtained during the scoring procedure. 
The period immediately before the arrival of the 
competition was used for conducting activities 
for increasing client loyalty with the aim of their 
long-term retention. Trgovina was aware of the 
fact that they will inevitably lose a portion oftheir 
customers, so their goal was to diminish those 
trends as much as possible. Although their final 
goal was to keep as many ofexisting customers as 
possible, the main emphasis was on the customers 
who were most valuable for the company. 

A further step was aimed at the creation of 
an early warning system in the shape of a cus- 
tomer churn analysis model, based on which it 
is possible to recognize basic regularities within 
the population of customers who either churn or 
decrease the intensity of purchase according to 
the specified criteria. 

The analyses conducted and business deci- 
sions made based on that analyses resulted in a 
significant reduction of customer churn, espec ialh 
in the younger customers segment, which “= 
targeted by the competition. Trends of custom 
churn related to customers with higher num 
of complaints were also diminished with the = 
of call centre operators who contacted cus* 


who had filed complaints trying to solve their 
problems and alleviate the consequences related 
to the complaints. In the long run, this approach 
resulted in an increase in customer satisfaction and 
consequently in growth of sales revenue. 

Trgovina managed to keep a satisfactory num- 
ber of customers who were ranked high in the 
scoring procedure, which was one of the goals. 
Regarding the new market conditions of fierce 
competition, Trgovina had to constantly monitor 
the development of the market and analyze market 
trends on a global and regional level, especially 
the population of its customers. Mandatory scor- 
ing on a monthly basis initiated the analysis of 
market segments differentiation and its causes on 
the level of the Croatian market and on regional 
levels. Such an approach enabled establishing an 
early warning system which is of vital importance 
in conditions where the competition continually 
undertakes targeted promotional actions with the 
aim of winning clients. The system of conducted 
analyses proved to be very effective concerning 
the situation on the market and it served as a basis 
for the development of further analytical systems 
that contributed to the success of Trgovina deci- 
sion support at all management levels. 

The presented case represents one of the 
possible solutions in given circumstances. The 
developed model shows good performance in 
practice. Churn is specific to a given area and 
there is no cookbook data mining solution which 
could be applied in each case (Berry, 2000; Berry, 
2003; Giudici, 2003; Namid, 2004). The existing 
model could be extended with an early warning 
system, and segmentation model which takes into 
consideration the customer prospective value as 
a part of customer relationship model. 

Churn analyzes will certainly use more than 
several common data mining models. As the 
market condition becomes complicated, and that 
fact leads us to combine a variety of data mining 
techniques to achieve better results. Customer 
relationship management systems will play a 
more important role in churn prediction, as churn 
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prevention system. Recognizing customer needs 
and behavior is first and most important step = 
churn prevention. 


REFERENCES 


Agosta, L. (2000). The Essential Guide to Dats 
Warehousing. Upper Saddle River, NJ: Prentice 
Hall. 


Aleksander, I., & Morton, H. (1995). An introduc- 
tion to neural computing. New York: Internationa 
Thompson Computer Press. 


Aracil, J., & Gordillo, F. (Eds.). (2000). Stabilin 
Issues in Fuzzy Control. Heidelberg, Germany 
Physica- Verlag. 


Berry, J. A., & Linoff, G. (1997). Data mining 


techniques for marketing sales and customer sup- 
port. New York: John Wiley &Sons Inc. 


Berry, J. A., & Linoff, G. (2000). Mastering data 
mining. New York: John Wiley &Sons, Inc. 


Berry, J. A., & Linoff, G. (2003). Mining the web 
New York: John Wiley &Sons Inc. 


Dresner, H. (2008). Performance management 
revolution. New York: John Wiley &Sons Inc. 


Faulkner, M. (2003). Customer management 
excellence. New York: John Wiley &Sons Inc. 


Giudici, P. (2003). Applied Data Mining: Statisti- 
cal Methods for Business and Industry. New York 
John Wiley &Sons Inc. 


Hampel, R., Wagenknecht, M., & Chaker, N 
(Eds.). (2000). Fuzzy Control: Theory and Prac- 
tice. Heidelberg, Germany: Physica- Verlag. 


Han, J., & Kamber, M. (2000). Data Mining: 
Concepts and Techniques. San Francisco: Morgan 
Kaufmann, 


Preparing for New Competition in the Retail Industry 


Klepac, G., & Mršić, L. (2006). Poslovna inteli- 
gencija kroz poslovne slučajeve. Zagreb, Croatia: 
Liderpress. 


Klepac, G., & Panian, Ž. (2003). Poslovna inteli- 
gencija. Zagreb, Croatia: Masmedia. 


Kohonen, T. (2001). Self-organizing maps. Berlin: 
Springer. 


Larose, D. T. (2005). Discovering Knowledge in 
Data: An Introduction to Data Mining. New York: 
John Wiley &Sons Inc. 


Larose, D. T. (2006). Data mining methods and 
models. New York: John Wiley &Sons Inc. 


Mannila, H., & Hand, D. (2001). Principles of 
Data Mining. Cambridge, MA: The MIT press. 


Namid, R. N., & Christopher, D. B. (Eds.). (2004). 
Organizational Data Mining: Leveraging Enter- 
prise Data Resources for Optimal Performance. 
Hershey, PA: Idea Group. 


Pedrycz, W., & Gomide, F. (1998). An Introduction 
to Fuzzy Sets: Analysis and Design of Complex 
Adaptive Systems. Cambridge, MA: MIT Press. 


Pyle, D. (1999). Data preparation for Data Min- 
ing. San Francisco: Morgan Kaufmann. 


Siler, W., & Buckley, J. J. (2005). Fuzzy expert 
systems and fuzzy reasoning. New York: John 
Wiley &Sons, Inc. 


Vose, D. (2000). Quantitative Risk Analysis. New 
York: John Wiley & Sons Inc. 


KEY TERMS AND DEFINITIONS 


Data Mining: Discovering hidden useful 
knowledge in large amount of data (databases) 

Fuzzy Logic: Logic which presumes possible 
membership to more than one category with 
degree of membership, and which is opposite to 
(exact) crisp logic 

Fuzzy Expert System: Expert system based 
on fuzzy logic 

Scoring: Process of assigning some value 
(usually numeric) as a grade to represent the 
performance of an observed case/object 

Churn: Interruption of the contract or using 
product or services 

Survival Analysis: Analysis which shows 
survival rate (example: from population of cus- 
tomers) in a defined period of time 

Cox Regression: One of the methods for 
survival analysis. 
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Section 4 
Data Mining as Applications 
and Approaches Related to 
Organizational Scene 
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An Exposition of CaRBS 
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Investigating Intra Organization 
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Te non-trivial extraction of implicit, previously unknown, interesting, and potentially useful informa- 
Son is at the heart of efforts to solve real-world problems; perhaps nowhere more so than in the field 
»f organization studies. This chapter aims to describe the ability of a nascent data mining technique, 
Classification and Ranking Belief Simplex (CaRBS), to undertake analysis in the area of organization 
research in the public sector. The rudiments of CaRBS, and the RCaRBS development also employed, 
are based on the general methodology of Dempster-Shafer theory (DST), as such, the data mining 
analysis undertaken with CaRBS is associated with uncertain modelling. Throughout this chapter, a real 
application is considered, namely, using survey data drawn from a large multipurpose public organiza- 
tion, to examine the argument that consensus on strategic priorities is, at least partly, determined by an 


organization 5 structure, process and environment. 


INTRODUCTION 


Deriving predictions from hidden patterns amongst 
large amounts of data is the cornerstone of data 
mining (Chen, 2001). The non-trivial extraction 
of implicit, previously unknown, interesting, and 
potentially useful information is at the heart of ef- 
forts to solve real-world problems (see Berry and 
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Linoff, 1997; Westphal and Blaxton, 1998); perhaps 
nowhere more so than in the field of organization 
studies. This chapter aims to describe the abil- 
ity of a nascent data mining technique, based on 
uncertain modelling, to undertake analysis in the 
area of organization research in the public sector. 
Further, the chapter also demonstrates how such a 
technique itself can be developed to perform more 
pertinent analysis in this area. 


Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 


The Classification and Ranking Belief Simplex 
(CaRBS) non-parametric technique, introduced in 
Beynon (2005a, 2005b), was presented as a novel 
approach to undertake data mining. The rudiments 
of CaRBS are based on the general methodology 
of Dempster-Shafer theory (DST), introduced in 
Dempster (1967) and Shafer (1976). As such, the 
data mining analysis undertaken with CaRBS is 
associated with uncertain modelling. Indeed, DST 
is considered one of the three key mathematical 
approaches to uncertainty modeling (Roesmer, 
2000), along with the probabilistic and fuzzy logic 
approaches (see Mantores, 1990; Zadeh, 1975; 
Yang et al., 2006). Further, it is often described 
as a generalisation of the well-known Bayesian 
theory (Shafer and Srivastava, 1990). 

One consequence of the association of the 
non-parametric technique CaRBS with DST, is the 
ability to undertake analysis in the presence of a 
form of mathematical based ignorance (Safranek 
etal., 1990; Beynon, 2005b). The original CaRBS 
technique is employed in the classification-type 
analysis of strategic consensus in a public orga- 
nization (see later), plus a development, termed 
RCaRBS, which affords the ability to undertake 
regression-type analysis on the same problem. 
The RCaRBS analysis presented here illustrates, 
at the technical level, how a data mining technique 
based on uncertain modelling, such as CaRBS, 
can be developed to undertake more general types 
of analysis pertinent to the types of continuous 
data generally used by organizational researchers 
(using RCaRBS). Indeed, uncertain modelling is 
uniquely able to accommodate the ambiguity that 
surrounds the subjective measures of organiza- 
tional characteristics that are often used in studies 
of strategic management (see Dutton etal., 1983). 

Throughout this chapter, a real application 
is considered, namely, using survey data drawn 
from a large multipurpose public organization, to 
examine the argument that consensus on strategic 
priorities is, at least partly, determined by an or- 
ganization’s structure, process and environment 
(Dess and Origer, 1987). This is a pertinent ap- 
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plication, since management theory suggests the 
strategic consensus has important implications = 
organizational performance (Bourgeois, 1987 
Despite widespread speculation on the veracity « 
these propositions on consensus (see Bowman ami 
Ambrosini, 1997), few studies have systematica™ 
examined the antecedents of strategic consensus 
in public, private or non-profit organization 
(see Kellermanns et al., 2005). Moreover, ther: 
has been a relative dearth of studies employing 
non-parametric techniques, such as the we 
known neural networks (and the CaRBS here = 
particular), in research on this and related issue 
in organizational research (see for example De- 
Tienne et al., 2003). 

Amongst the technical expositions given = 
this chapter, the binary classification based dats 
mining presented, using the original CaRB® 
technique, is operationalised in terms of a com 
strained optimisation problem. This problem = 
solved here using the evolutionary computatioe 
technique Trigonometric Differential Evolutios 
(TDE - Fan and Lampinen, 2003), which em- 
ploys an objective function which confers the 
minimisation of ambiguity, in the classificatioe 
of vertical consensus based on managers’ percep- 
tions of strategic priorities, but not concomitas: 
ignorance (Beynon, 2005b). The second analys 
exposited, using the RCaRBS development os 
CaRBS, demonstrates regression-type analysis i 
the presence of ignorance (again using TDE and 
an objective function based on the minimisatioe 
of the level of predictive fit ‘sum of squares error 
of the degree to which vertical consensus exist os 
perceived strategic priorities). This latter analysis 
is pertinent since the majority of quantitative 
organizational research is regression oriented 
(DeTienne et al., 2003). 

Throughout the analysis presented in this chap- 
ter, there is emphasis on the graphical representa- 
tion of results, primarily using the simplex plet 
method of data representation, an intrinsic part of 
the CaRBS technique (and RCaRBS), explicitly 
referred to in its introduction (see Beynon, 2005a 
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The intention of the chapter is to exposit the work- 
ings and potential application of a data mining 
technique based on uncertain modeling (CaRBS 
and RCaRBS) for organizational research, in this 
ease through the general methodology of DST. 
With respect to the considered intra-organizational 
consensus problem, how the CaRBS based data 
mining modelling of strategic consensus fits 
with the considered organization characteristics 
investigated is of theoretical and practical interest, 
with the graphical inferences produced offering 
2 novel clear perspective on how organizations 
can adapt their internal characteristics to increase 
levels of strategic consensus. 

The structure of the rest of the book chapter is 
as follows: The background section discusses the 
general methodology of DST, including a small 
example, followed by an exposition of the CaRBS 
technique and its development with RCaRBS. The 
main thrust of the chapter describes the strategic 
consensus problem considered, before present- 
ing the CaRBS and RCaRBS analyses. Finally, 
conclusions are drawn and the implications of the 
content of the chapter are discussed. 


BACKGROUND 


The background discussed here surrounds the 
concomitant technical issues, namely an exposi- 
tion of the CaRBS technique (and the associated 
RCaRBS development), and before it, the gen- 
eral methodology of Dempster-Shafer theory is 
described, upon which the CaRBS technique is 
grounded. 


Dempster-Shafer Theory 


The methodology underwriting the technique 
described in this section is Dempster-Shafer 
theory (DST), introduced in Dempster (1967) and 
Shafer (1976), and generally acknowledged to be 


amathematical approach to uncertainty modelling 
(Roesmer, 2000). As a technique viewed in terms 
of probabilistic reasoning, DST is also considered 
one of the fundamental methodologies making up 
the notion of soft computing (ibid. ). 

Fundamentally, DST is based on the idea of 
obtaining degrees of belief for one question (the 
equivalent of a dependent variable), from subjec- 
tive probabilities describing the evidence from 
others (the equivalent of independent variables), 
and that the concordance of pieces of evidence 
reinforce each other. This evidential reasoning 
methodology, it has been argued, is a general- 
ization of the well-known Bayesian probability 
calculus (Shafer and Srivastava, 1990; Schubert, 
1994), see also Dempster (2008) fora contempory 
reflection of DST. 

With DST a general methodology, its funda- 
mentals consider a finite set of p hypotheses © = 
{opipa On} called a frame of discernment. A 
mass value is a function m: 2° — [0, 1] such that 


m(Ø) = 0 (Ø - the empty set) and Ja m(s) = 1 (2° 
se2? 

- the power set on ©). Any proper subset s of the 
frame of discernment ©, for which m(s) is non- 
zero, is called a focal element and the m(s) value 
represents the exact belief in the proposition 
depicted by s. The collection of mass values (and 
focal elements) associated with a single piece of 
evidence is called a body of evidence (BOE). The 
mass value m(@) assigned to the frame of discern- 
ment @ is considered the amount of mathematical 
ignorance within the BOE, since it represents the 
level of exact belief that cannot be discerned to 
any proper subsets of ©. 

DST also provides a method to combine the 
BOEs from different pieces of evidence, using 
Dempster’s rule ofcombination. This rule assumes 
these pieces of evidence are independent, then 
the function [m, ® m,]: 2° — [0, 1], acting on 
two BOEs, is defined by (on a focal element s); 
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[m, ® m,](s) = 
0 TS 
> m(sm,(s,) 
Se ee te 


and is a mass value, where s, and s, are focal ele- 
ments from the BOEs, m (-)and m,(-), respectively. 

The combination rule can be considered over 
all the elements in the power set of © (s € 2°), to 
formulate the new BOE. Further, the combination 
rulecan be used iteratively to combine the evidence 
contained in a number of BOEs. The denominator 
part of the combination rule includes 

3 m,(s,)m,(s,), considered to measure the 
si NAs, =Ø 
level of conflict in the combination process be- 
tween BOEs (Murphy, 2000), and is based on the 
sum of the products of mass values associated 
with focalelements from the different BOE, which 
have empty intersection. 

To clarify a reader’s understanding of the 
fundamentals of DST, a small example is next 
presented. Commonly called the “assassins prob- 
lem”, ithas previously been presented to elucidate 
the fundamental of DST and its own development 
(see for example, Smets, 1990). Let us say there 
are three individuals (assassins), Henry, Tom and 
Sarah, who are suspects for the murder of Mr. 
White. Within DST, these three suspects make up 
aframe of discernment, © = {Henry, Tom, Sarah}. 
Two witnesses (W1 and W2), have information 
regarding the murder of Mr. White: 


Witness W1: is 80% sure that the murderer was 
aman. 

Witness W2: is 60% confident that Henry was 
leaving on a jet plane when the murder 
occurred. 


Each of these pieces of evidence ( information) 


are converted, in DST, into concomitant BOEs, 
defined m,,,(-) and My, (+). 
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Witness W1’s evidence furnishes belief on the 
murderer being a man, which pertains specifically 
to Henry and Tom (the male suspects). Thus, in 
the respective BOE m,,,(-), the focal element 
{Henry, Tom} exists, with associated mass value 
0.8, namely m,,,({Henry, Tom}) = 0.8. This mass 
value is assigned to the set {Henry, Tom} and not 
the individual elements in the set, indeed, centra! 
to DST is that the distribution of this mass value 
amongst the elements of such a set is unknows 
(Srivastava and Liu, 2003). Since there is ne 
information regarding the remaining mass value 
(1.0—0.8=0.2), itis considered ignorance (math- 
ematical), and allocated to © (all the suspected 
assassins), hence m,,,({Henry, Tom, Sarah}) = 
0.2 (= m,,,(©)). 

Following a similar argument, the BOE con- 
structed from the evidence of witness W2, the 
respective BOE m,,(-) is made up of the two foca 
elements and mass values, m,,({Tom, Sarah} ) = 
0.6 and m,,,({Henry, Tom, Sarah}) = 0.4. Within 
the two BOEs, M,C’) and m, (+), each mass value 
represents the exact belief in that focal element(of 
suspects), including the murderer of Mr. White 

Having established the mathematical evidences 
from the two witnesses (two sources of informa- 
tion), its combination (aggregation) is next carrie 
outusing Dempster’s combination rule, presuming 
the witnesses are giving independent evidence 
At the technical level, the combination process 
is based on the intersection and multiplicatior 
of the focal elements and mass values from the 
previously constructed BOEs, m,, (+) and m, {1 
see Table 1, for intermediate findings of the com- 
bination process. 

In Table 1, the intersection and multiplicatioe 
of the focal elements and mass values includes 
in the BOEs, my) (first column) and m,- 
(first row), are presented (bottom right hand of 
table). Amongst the findings, the new focal ele- 
ments found are all non-empty, it follows, the 
level of conflict aye m,,,(S, my, (s,) = 0 (part 


s Ns, =B 


of the denominator of the combination rule — see 
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Table 1, Intermediate findings from the combination of the BOEs, m,,,(-) and my) 


My, () \ Myl) 


{Henry, Tom}, 0.8 
©, 0.2 


previously), then the resultant BOE, defined m (-), 
can be taken directly from the results in Table 1 
(since denominator part of combination rule equals 
1), namely; 


m,({Tom}) = 0.48, m,({Henry, Tom}) = 0.32, 


m,({Tom, Sarah}) = 0.12 and m,({Henry, 
Tom, Sarah}) = 0.08. 


Amongst this combination of evidence, in the 
BOE m „(:), the mass value assigned to ignorance 
(m,(®) = m,({Henry, Tom, Sarah}) = 0.08), is 
less than that present in the original individual 
witness based BOEs (m,,,(-) and m,,(-)), as 
expected when combining evidence using DST. 
In summary, the combined evidence, in terms of 
mass values, is spread over a number of focal 
elements of suspects (more focal elements than 
present in any of the individual witness BOEs). 
The newly formed BOE m, (-), could then be used 
to exposit the evidence supporting the individual 
suspects association with being the murderer of 
Mr. White (see later in the context of the CaRBS 
based analyses undertaken). 


The Classification and Ranking 
Belief Simplex (CaRBS) and RCaRBS 


The CaRBS technique was originally devised as 
a tool to undertake the binary classification and 
ranking of objects in the presence of ignorance 
(see Beynon, 2005a). Alongside its description 
(and subsequent employment), here it is devel- 
oped to perform regression-type analysis (termed 
RCaRBS). 


{Tom, Sarah}, 0.6 ©, 0.4 


{Tom}, 0.48 
{Tom, Sarah}, 0.12 


{Henry, Tom}, 0.32 


©, 0.08 


The technical details of the CaRBS technique 
are next briefly described (see Beynon, 2005a; 
2005b, for further details), with its subsequent 
development to undertake regression-type analy- 
ses then exposited (using RCaRBS). To aid in the 
clarity of the presentation, the given description 
will be undertaken using terminology, where ap- 
propriate, associated with the regression models 
conventionally used in strategic management 
research (see Meilich, 2006), whereby objects 
(in this case survey respondents - see later) are 
associated with a dependent variable (e.g. stra- 
tegic consensus) and described by a number of 
independent variables (e.g. respondents’ perceived 
organizational characteristics). 

Within CaRBS, the information from arespon- 
dent’s perceivance of an organization’s character- 
istic on a specific issue (see later), termed here 
a characteristic value, is quantified in a BOE, 
generally denoted by m(-), where all assigned 
mass values sum to unity and there is no belief 
in the empty set (as stated earlier in the technical 
description of DST). Moreover, for a respondent 
R, (1 <j <n) and the i characteristic C,(1 <i < 
n) describing it, a characteristic BOE, defined 
m C), is made up of the mass values, m ({x}) 
and m ({7x3), which denote levels ofexact belief 
in the association of the object to a hypothesis x 
(strategic consensus) and not-the-hypothesis ~x 
(strategic not-consensus), and m, (AX, —x}) the 
level of concomitant ignorance. In the case of 
m (x, —x}), its association with the term igno- 
rance is because this mass value is unable to be 
assigned specifically to either x or =x. 

The characteristic BOE represents the evidence 
from one of a respondent’s characteristic values 
(responses). From Safranek et al. (1990), used in 
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CaRBS, the mass values in a characteristic BOE 
are given by the expressions (for a characteristic 
value v); 


AB. = 
et > m ({~x}) ay 


B, 
$ = +—cf(v)— 
ge l= 4 i”) l 


1 


and m, ({x,7x})=1 =m ({x})— m, ({x}),where 
cf, (v)=1/(1+ exp(-k; (v-0 D) (a Semed function 
similar to that used in neural networks - see later), 
andk,,@,,A,and B are control variables incumbent 
in CaRBS (for its configuration). Importantly, if 
either m, {x}) or m { {7x}) are negative they are 
set to zero, and the respective m, {x, =x} ) then 
calculated. 

Figure 1 presents, with respect to the CaRBS 
technique, a graphical presentation of the process 
from a characteristic value v to a characteristic 
BOE, and its subsequent representation asa single 
simplex coordinate in a simplex plot (and then 
its “regression” to a single predicted strategic 
consensus value as part of the RCaRBS develop- 
ment - discussed later). 

In Figure 1, one of a respondent’s character- 
istic values v is first transformed into a confidence 
value (1a), from which it is de-constructed into 
its associated characteristic BOE (15), made up 
of the triplet of mass values, m ({x}), m ({~*}) 
and m,({x, ~x}), using the expressions given 


Box 1. 


[m,, Bm, \({x}) = 


m,,\({>x}) = 


m, ({x})m,«({x}) + m, xH, (x, =x}) + m {x})m, {x -x}) 
l= (m (xm (Leh) + m, Hm a) 
m, ({x})m (A }) +m, (Lx, x} m (h) + m(x} m 
= (m {op}, Lh) + m, (Hmn, Aa) 


[m,, ml% w) =1—[m,, e m AH —[m,, em IaH. 
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previously. Stage (1c) then shows a ae c 
BOE m (°); m (8) = v p m ({-x}) = 

m ({x, a= v, 32 CAN a ea ae as asimple ze 
coordinate (p,,,) in a simplex plot (equilaters 
triangle), labeled De, „in this case. That is, a poise 
P,,, exists within an equilateral triangle such tha 
the least distance from P,,, to each of the sides of 
the equilateral triangle are in the same a oe ms 
(ratios) to the values, v, 749 Vij and v,,, (see fox 
example, Canongia Lopes, 2004). In tte case © 

a simplex plot with unit side, with vertices (0, 0 


(1,0) and (0.5, 0.5 fs J: thep,,, simplex coordinate 


CR Y, ) is given by x Va F 0.59, 5 


‘sy, 


Tis 

The set of characteristic BOEs {m () i=l 

. Ao}, associated with a respondent R, found 
fort its characteristic values, can be combined 
using Dempster’s combination rule into a respon- 
dent BOE, defined m). Moreover, considering 
m, (C)andm, mG )astwo independent characteristic 
BOEs, (m, ® m,,](-) defines their combination 


and y = 05 


(on a single focal element), and is given here by 
(in terms of a newly created BOE made up of 
three mass values): (see Box 1) 

The ability to explicitly write out the combina- 
tion of two characteristic BOEs (rather than the 
original combination rule), is due to a binary frame 
of discernment being considered (the hypotheses x 
and ~x only). This process is then used iteratively 
to combine all the characteristic BOEs describ- 
ing the evidence in a respondent’s characteristic 


3 


udo} 
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values, into its associated respondent BOE. In the 
original CaRBS technique, the respondent BOE 
contained the evidence that described arespondent’s 
association to the considered, hypothesis, not-the- 
hypothesis and concomitant ignorance (viewed here 
as binary classification with ignorance). 

To illustrate this method of combination (il- 
lustration given because of its novelty), the two 
example BOEs, m,(-) and m,(-), shown in Figure 
lc, are considered. Their combination to a BOE, 
denoted m,(-) (= [m, ® m,](-)), is evaluated to be 


m({x}) = 0.467, m ({7x}) = 0.224 and m ({x, 
—x}) = 0.309. This combination process is graph- 
ically shown within the simplex coordinate rep- 
resentation of the combined BOE m,(-)presented 
in Figure 1c (with evaluated simplex coordinate 
(0.622, 0.268)). i 


The relative position of m,(-) to the simplex 
coordinates of m,(-) and m,(-) shows it is nearer 
the base line of the equilateral triangle (furthest 
away from the {x, ~x} vertex of the presented 
BOEs), so has less associated ignorance than each 
of the pieces of evidence that combined to create 
it (as is the case). Further, the horizontal position 
of m_(-), nearer to the {x} vertex that the {7x} 
vertex, indicates the evidence in m_(-) supports 
more the association to x than 7x. 

The CaRBS technique is governed by the val- 
ues assigned to the incumbent control variables 
k, 0A, and B, evaluated through a configuration 
process. Where these control variables contribute 
directly to the construction of the characteristic 
BOEs m()s which are combined to produce 
the respective respondent BOEs m(:). A CaRBS 
configuration is considered a constrained optimi- 


Figure 1. Stages in CaRBS for a single characteristic value v to formulate a characteristic BOE and its 


representation in a simplex plot 


1 
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sation problem (see later), able to be solved using 
an evolutionary algorithm such as Trigonometric 
Differential Evolution (TDE - Storn & Price 1997, 
Fan & Lampinen 2003). In summary, TDE is an 
evolutionary algorithm that iteratively generates 
improved solutions to an optimization problem 
through the marginal changes in previous solu- 
tions with the differences in pairs of other previ- 
ous solutions. 

The effectiveness ofa configured CaRBS sys- 
tem is measured by a defined objective function 
(used in the TDE process), whether for classifica- 
tion-type or regression-type analyses. WithCaRBS 
(classification-type analyses), an objective func- 
tion, defined OBC, uses the equivalence classes, 
E(x) and E(-x), the groups of respondents known 
to be associated with x and ~x, respectively. The 
optimum solution, based on the respondent BOEs 
m (+) here, is to maximise the difference values 
MEY) = m({-x})) and (m({>x}) = m9) 
depending on where the considered respondent 
R, is associated with x (in E(x)) or ~x (in E(-x)), 
respectively, where optimisation is minimisation 
with lower limit zero, the OBC is given by (see 
Beynon, 20055): (see Box 2) 

Also demonstrated in Figure 1c is the develop- 
ment of the CaRBS technique that allows it to be 
considered a tool for regression-type analysis, the 
proposed RCaRBS derivative of the original 
CaRBS technique. Continuing the example above, 
the BOE m,(-) (potentially representing a respon- 
dent BOE), includes the evidential information 
to calculate the associated predicted value over 


Box 2. 


ere Al ECs 


CE (x) 


>, (=m, ({x}) + m,({-x})) + 
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the domain ranging from ~x to x (as would == 
found from a regression analysis), where ean 
respondent R, has an actual known value in ths 
domain. Returning to Figure 1c, this predicts: 
value is found by projecting the associated simples 
coordinate for m,(-) onto the base line of te 
simplex plot (projected using a line from the 

=x} vertex through the simplex coordinate = 
m(-)). Representing the simplex coordinate = 
mÈ) as (Xo Yc), and considering an equilaters 
triangle of unit side (as previously), the projecta: 


value is given by (v3 x, ~y M3 — 2y,), va 


a domain 0 to 1. 

The projected value evaluated for each respos- 
dent, found this way, is considered their respective 
predicted value, defined Rp (in keeping with the 
use of the equilateral triangle with unit side & 
RCaRBS, the original strategic consensus values 
(see later), are a priori formatted into the same 
0 to 1 domain - through normalization, see Kim 
1999), For the example considered here, using 
m(:), with x, = 0.622 and y, = 0.268 found pre- 
viously, the projected value from m,(-) is 0.675% 
(see Figure Ic). One feature of this projection is 
that the evaluated predicted value is devoid of 
an associated ignorance value (existing in the 
associated respondent BOE). Importantly alse 
the roles played by {x} and {~x} are different to 
that in the original CaRBS (hypothesis and not- 
the-hypothesis), now they are associated with the 
limits on some variable term (such as a continuous 
strategic consensus value - see later). 


LO Jy +m, ({x})—m (f-x}))]. 


| E(x) | R,€E(-x) 


in the limit, 0 < OB < 1. It is noted, maximising a difference value such as (m(Ux}) — m({~x})) mi- 
nimises classification ambiguity but only indirectly the associated ignorance. 
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As with the original CaRBS, the required 
configuration of a RCaRBS model depends on 
the assignment of values to the incumbent control 
variables (k,,@,,4,and B,i=1,...,7,).In RCaRBS, 
this configuration is defined by minimizing the 
error between the respective actual and predicted 
strategic consensus values (through its objective 
function - defined OBR). The specific measure 
(OBR) employed will focus on using the well 
known sum of squares error term, see Radhakrish- 
nan and Nandan (2005). With the respondents’ 
actual strategic consensus values Rv, GEST) 

and respective predicted values Ap from a 
RCaRBS configured model, the fit is measured 


by OBR = SOQ, SRP 
j 


With closed domains also defined on the 
individual control variables present in RCaRBS, 
the minimization of OBR similarly becomes a 
constrained optimization problem, solved here 
again using the evolutionary computation algo- 
rithm TDE. The necessary operating parameters 
used throughout this chapter with TDE, were 
(ibid.): amplification control F = 0.99, crossover 
constant CR = 0.85 and number of parameter 
vectors NP = 200. 


MAIN THRUST 


This section outlines the main thrust of this chap- 
ter, namely the exposition of the CaRBS (and 
RCaRBS) technique in the data mining analysis 
of an organization research problem, in this case 
the antecedents of intra-organizational consensus 
on strategic priorities. 

Strategic consensus is a key issue within the 
literature on strategic management (see Keller- 
manns et al., 2005). Intra-organizational agree- 
ment on strategic priorities is a vital resource for 
senior managers seeking to reap the benefits of 
cooperation and coordination for organizational 
performance (Bourgeois, 1980, 1985; Dess, 1987; 


Homburg et al., 1999). The benefits of consensus 
may be especially significant for government or- 
ganizations, since they are often required to meet 
multiple and often conflicting goals that place 
great demands on the need for close collaborative 
working relationships (Moore, 1995). Indeed, 
theory and evidence has grown on the role of intra- 
organizational behaviour and strategic planning 
and management in public organizations (see for 
example, Boyne and Walker, 2004; Bryson, 2004). 
However, although much has been written about 
the hypothesised benefits of intra-organizational 


consensus in the private sector, Jitthe is yetknown 


about its antecedents or how it might best be 
analysed in either the public or private sectors. 


In particular, few researchers have systematically 
evaluated the correlates of “vertical” consensus 
across different managerial levels, rather than 
“horizontal” consensus within top management 
teams, and to date none have drawn upon uncertain 
modelling techniques. 

Principal agent models of managerial decision- 
making indicate that top management is likely 
to seek agents who share the same values and 
priorities. In other words, senior managers in 
public organizations will seek to establish align- 
ment on strategic priorities across the organization 
to minimize the potential for shirking amongst 
middle managers. To achieve consensus on 
strategic priorities, the most effective manage- 
rial choices might be to centralize the decision- 
making of the organization, formalize the role and 
responsibilities of staff and introduce systematic 
planning processes thereby obviating the need 
to expend significant time and resources to gain 
a high level of alignment. These choices may be 
even more important when organizations confront 
a high degree of environmental uncertainty, as 
they can reduce the transaction costs associated 
with generating a coordinated response to difficult 
operating conditions. Indeed, contingency theory 
also suggests that the degree of consensus on 
strategic priorities, present within organizations, 
is likely to be positively related to centralization 
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and formal planning processes, while perceived 
environmental uncertainty is associated with dis- 
sensus on strategic priorities (Dess and Origer, 
1987). 

Evidence on this issue is so far sparse and has 
been restricted to studies of “horizontal” consensus 
amongst top management teams within private 
firms, principally in North America (e.g. Priem, 
1990). By applying an uncertain modelling ap- 
proach, such as CaRBS and RCaRBS, to the issue 
of “vertical” consensus between senior and middle 
managers within the public sector, we therefore 
seek to introduce a novel technique for data min- 
ing within organizational research, and provide 
initial exploratory findings on an important but 
under-researched topic within the field. 


Consensus Data Set 


The consensus data set is drawn from a ques- 
tionnaire survey gauging managers’ views on 
strategic management within a large urban local 
government in Wales. Welsh local governments are 
governed by elected bodies with a Westminster- 
style cabinet system of political management, in 
which the cabinet represents the de facto executive 
branch of government, and is usually made up of 
senior members of the ruling political party. They 
operate in specific geographical areas, employ 
professional career staff, and receive approxi- 
mately two-thirds of their income from the central 
government. The local government analysed here 
is a multipurpose authority providing education, 
social care, environmental services (such as land 
use planning and waste management), housing, 
and leisure and cultural services. By focusing on 
a single local government, we are able to draw 
on a more comprehensive coverage of managers 
throughout an organization than would be possible 
using a sample of several governments. In doing 
so, we follow Meyer et al.’ (1993) argument that 
researchers should pay more serious attention to 
the micro-level determinants of intra-organiza- 
tional behaviour. 
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Data on strategic management in our samp 
organization are derived from an electronic sums 
of top and middle managers within the local gos- 
ernment conducted in autumn 2002. The sums 
explored informants’ perceptions of organization 
management and performance (a copy of the & 
questionnaire is available on request from the 
authors). Survey respondents were asked a series 
of questions assessing strategy, structure, proces: 
and environment in the organization. For eact 
question, informants placed the organization om 
seven-point Likert scale ranging from 1 (strong: 
disagree with the statement) to 7 (strongly agree 
with the statement). The sampling frame com 
sisted of 82 informants within five major service 
departments and the top management team of the 
organization. Responses were received from 4% 
per cent of individual informants (40 — 2 mem- 
bers of the top management team and 38 middle 
managers). 

In this study, the priority accorded to Miles 
and Snow’s (1978) defending strategy is inves- 
tigated. Empirical studies have indicated that 
defending is often a successful strategy in the 
public sector (e.g. Andrews et al., 2009; Meier 
et al., 2007). Defending organizations typically 
take a conservative view of innovation, focusing 
on service quality and devoting ‘primary atten- 
tion to improving the efficiency of their existing 
operations’ (Miles and Snow 1978, p. 29). Te 
explore the extent to which our study organization 
displayed defender characteristics, informants 
were asked three questions: “we seek to maintain 
stable service priorities”; “the service emphasizes 
efficiency of provision”; and “we focus on our 
core activities”. These questions were all based 
on prior work (Snow and Hrebiniak, 1980; Miller. 
1986). To capture the complex multi-dimensiona! 
nature of strategic management, a single defend- 
ing index was then created for the purposes of our 
analysis (from the three questions). 

To gauge the relative degree of consensus on a 
strategy of defending within the study organization 
we take the absolute value of the distance between 
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Table 2. Descriptive statistics for defending and 
consensus on defending 


Std 
Description Mean dev 


We seek to maintain stable service priorities | 2.9211 | 1.1942 


The service emphasizes efficiency of 3.0263 | 1.5680 
provision 


Consensus on defending 8.8435 | .7392 


each middle manager’s perceptions of defending 
within the organization from that held by the two 
top managers. The resulting consensus measure 
for the middle managers was subtracted from 10 
to ensure that a higher score indicated greater 
consensus with the priority accorded to defending 
by top management. Even though the decision 
variable of consensus on a defending strategy is 
continuous in nature, here, it is also considered as 
abinary variable, through the discretisation of the 
variable. This discretisation is created by assigning 
the value of 0 where a respondent’s strategy value 
is below the organizational mean and 1 where it is 
above the mean. Descriptive statistics for middle 
managers’ perceptions of defending and their 
score on the measure of defending consensus are 
shown in Table 2. 

In this study, eight organization characteristics 
are hypothesised to influence relative degree of 
consensus between middle and top management 
ona defending strategy, described here as respon- 
dents (middle managers) perceived organization 
characteristics. To control for the possibility that 
strategic priorities varied by service department 
within the local government, four dichotomous 
variables were included, coded 1 for each of the 
service departments in which respondents worked 
(education, social services, environment and 
leisure) and 0 otherwise (with housing the omit- 
ted category). 

Centralized decision-making is arguably asso- 
ciated with higher levels of consensus. The relative 
degree of centralization within the organization 


was gauged by asking middle manager respondents 
whether: “strategy for our service is usually made 
by the head of service”. Similarly, highly formal- 
ized job specifications are likely to tighten the 
link between the priorities of senior and middle 
managers. By contrast, if middle managers enjoy 
significant levels ofjob autonomy it is conceivable 
that consensus is more difficult to achieve. This 
was evaluated by simply asking middle manager 
respondents if they experienced: “a great deal of 
autonomy”. Systematic step-by-step procedures 
for the formulation of strategic decisions can re- 
duce the potential for divergent views to emerge, 
thereby bolstering levels of intra-organizational 
consensus. To assess the presence of rational plan- 
ning processes, the following question was posed 
to middle managers: “targets in the service are 
matched to specifically identified citizen needs”. 
Finally, to assess the extent to which perceived 
environmental uncertainty influenced consensus, 
middle manager respondents were asked if: “the 
socio-economic context is unpredictable”. Table 
3 presents a brief description of the respondent 
(middle manager) perceived organization char- 
acteristics and concomitant descriptive statistics 
(used later in the CaRBS and RCaRBS analyses). 


CaRBS Analysis 


This section undertakes a CaRBS analysis of the 
previously described strategic consensus data 
set. The CaRBS analysis is to undertake binary 
classification optimisation using the defined ob- 
jective function OBC (previously defined). The 
utilisation of the objective function OBC in the 
configuration of the CaRBS system is to directly 
minimize the level of ambiguity present in the 
classification of managers (respondents) within 
the organization to their association to strategic 
consensus and not-consensus, but not the con- 
comitant ignorance (binary classification with 
ignorance using CaRBS). 

To configure a CaRBS system through the 
minimization of the respective OBC, the respon- 
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Table 3. Description of respondents (middle manager) perceived organization characteristics and con- 


comitant descriptive statistics 


Table 4. Control variables values associated with respondent perceived organization characteristics 


using OBC in configuration of CaRBS system 


Tea Sed 


—2.869 —3.000 


C3 


0.596 0.600 0.600 


dent perceived organization characteristic values 
were standardized (using the descriptive statistics 
in Table 3), prior to the employment of TDE (see 
later), allowing consistent domains over the control 
variables incumbent in CaRBS, set as; —3 < k,<3, 
-2 <0, <2, 0 <A, <1 and B, < 0.6 (see Beynon, 
2005b). The upper bound on the B, control vari- 
ables ensured a predominance of ignorance in the 
evidence from individual characteristic values (in 
the concomitant characteristic BOEs), so reduced 
over-conflict during the combination of the pieces 
of evidence (combination of characteristic BOEs). 

The TDE method was employed, based on 
the previously defined TDE-based parameters, 
and run five times, each time converging to an 
optimum value, the best out of the five runs be- 
ing OBC = 0.304. A reason for this value being 
away from its lower bound of zero is related to the 
implicit minimum levels of ignorance associated 
with each characteristic BOE (fixing of the upper 
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bounds of B control variables), possibly also due 
to the presence of conflicting evidence from the 
characteristics. The resultant CaRBS associated 
control variables found from the best TDE rus 
are reported in Table 4. 

A brief inspection of these results shows the 
near uniformity in the k, control variables, wit 
the majority of the absolute values near the lim= 
of 3.000 (positive and negative), the exceptics 
being with C1. This exhibits the attempt to offs: 
most discernment between the hypothesis (strat- 
egy consensus) and its complement (strates 
not-consensus), in the evidence from the respos- 
dent perceived organization characteristics (see 
Figure 1 and definition of confidence factor cf(- 
The role of these defined control variable values 
is to allow the construction of characteristic BOEs 
and their subsequent combination to formulate « 
series of respondent BOEs for the 38 respondents 
considered. 
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Table 5. Characteristic values and characteristic BOEs for the respondents, R, and R,,, using OBC in 


configuration of CaRBS system 


R, (standardized) 2.053 2.208 1.234 

=f os | no 

=e) son | a 

E T Aion] | coat] — caer | oe 
Se ee ae els 

R, (actual E 

R, (standardized) 

m C) 

G 

Ca] e a e a e E 


The construction of a characteristic BOE is 
next demonstrated, considering the respondent 
R, and the organization characteristic C1. Start- 
ing with the evaluation of the confidence factor 
cf- (C) (see Figure 1a), for the respondent R,, Cl 
= 1.000, when standardised, it is v = 1.795 (see 
Table 5 presented later), then; 


as 1 
1+ 19.505 


1 


fick g2 86901.795—0.760) 


ef. (1.795) = 
= 0.049, 


using the control variables in Table 4. This con- 

fidence value is used in the expressions making 

up the mass values in the characteristic BOE 

m, C), namely; m oi ({C})s m o({-C}) and 
m, «,({C, ~C}), found to be; 


0.596 0.252 x 0.596 

m C}) = ——— > 
ra) 1-0252 1=0.252 

= 0.039 — 0.201 = —0.162 < 0.000 so = 0.000, 


0.049 — 


—0.596 
BCH) 5550 


—0.039 + ses =.0.557; 


0.049 + 0.596 = 


EEEE = 1 — 0, 000 — 0.557 = 0.443. 

For the respondent R,, this characteristic BOE 
is representative of all the associated character- 
istic BOEs m, (-), presented in Table 5 (using 
standardised characteristic values), along with 
those for the respondent R, . These characteristic 
BOEs describe the evidential support from all 
the perceived organization characteristic values, 
associated with a respondent, to the overall intra- 
organizational strategic consensus or strategic 
not-consensus classification (R, and R,,, are 
known to exhibit consensus and not-consensus 
with the top management team’s perspective on 
defending, respectively). 

In Table 5, for the evidence from the charac- 
teristics to support correct classification of the 
respondent R, in this case to strategic consensus 
({C}), it would be expected for the m ({C}) 
mass values to be larger than their respective 
m ({7C}) mass values, which is the case for the 
characteristics, C5, C6 and C7. Whereas, C1 and 
C8, offer more evidence towards the respondent 
having not-consensus, and C2, C3 and C4 only 
total ignorance. The predominance of character- 
istic BOEs supporting correct classification (of 
those giving evidence), is reflected in the final 
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Figure 2. Simplex coordinates of characteristic and respondent BOEs for R, and R,, using OBC i 


configuration of CaRBS system 
a) R, {C,-C} 


respondent BOE m,(-) produced (through the 
combination ofall the characteristic BOEs), which 
has mass values m,({C}) = 0.684, m,({-C}) = 
0.259 and m,({C, ~C}) = 0.057. This respondent 
BOE, with m,({C}) = 0.684 > 0.259 = m ({-C}), 
suggests the respondent R, is more associated 
with strategic consensus, which is the correct 
classification in this case. 

For the respondent R,,, the evidence from the 
characteristics is more towards their strategic 
not-consensus (in particular C2, C7 and C8). The 
combination of the concomitant characteristic 
BOEs produces a respondent BOE m,,(-), with 
m,({C}) = 0.189, m, ({-C}) = 0.733 and m, ({C, 
=C}) = 0.079, which indicates majority associa- 
tion to strategic not-consensus ({-C}), which is 
correct in this case. For further interpretation of 
the characteristic and respondent BOEs associated 
with the respondents, R, and R, ,, their representa- 
tions as simplex coordinates in a simplex plot are 
reported in Figure 2. 

Figures 2a and 25, offer a visual representation 
of the evidence from the eight perceived organi- 
zation characteristics to the classification of the 
respondents, R, and R,,, as to whether they ap- 
proximate strategic consensus (C) or not-consen- 
sus (7C). In each simplex plot the dashed vertical 
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b) Rs 


line partitions the regions in a simplex plot where 
either of the mass values assigned to {~C} (to the 
left) and {C} (to the right) is the larger in a BOE 
The grey shaded sub-regions show the domains 
where the characteristic BOEs can exist (due tc 
the bounds on B, control variables). 

In both presented simplex plots, the simples 
coordinates of the final respondent BOEs, m.i 
and m,,(-), are nearer their base lines than those of 
the associated characteristic BOEs. This is solely 
due to the reduction of ignorance from the com- 
bination of evidence present in the characteristic 
BOEs (see also Table 5). The positions of the 
simplex coordinates of the characteristic BOEs 
allow their possibly supporting and conflicting 
support for correct (or incorrect) classification of 
the respondents to be clearly identified (compare 
with discussion of characteristic BOEs associated 
with respondent R, ). 

The process of positioning the classification 
of a respondent, in a simplex plot, on the strategic 
consensus, can be undertaken for each of the 38 
respondents considered, see Figure 3. 

Figures 3a and 3b partition the presentation 
of the respondents’ respondent BOEs between 
those known to be more associated with being 
strategic not-consensus (3a) and consensus (3b). 
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Figure 3. Simplex plot based representation of final respondent BOEs, using OBC in configuration of 


CaRBS system 


a) Not-Consensus {C, =C} 


where each respondent BOE is labelled with a 
circle and cross, respectively. Based on their 
simplex coordinate respondent BOE positions 
either side of the vertical dashed lines in the sim- 
plex plots in Figure 3, it was found 13 out of 15 
(86.666%) and 17 out of 23 (73.913%) respondents 
were correctly classified as strategic not-consen- 
sus (~C) and consensus (C), respectively. This 
combines to a total of 78.947% classification ac- 
curacy. 

Consideration of the contribution of the 
individual respondent perceived organization 
characteristics can be graphically gauged from 
combining stages a and b in Figure 1, for the in- 
dividual characteristics C1, ..., C8, see Figure 4. 

In Figure 4, each graph shows the explicit mass 
values which make up acharacteristic BOE, based 
on the individual characteristic values from re- 
spondents. From the description of the respondent 
perceived organization characteristics considered, 
see Table 3, C1 to C4 are dummy variables (de- 
noted here by 0 and 1 values - see later), C5 to 
C8 are based on Likert values, here shown over 
the minimum and maximum values shown in the 
strategic consensus data set (of respondents). 

For the characteristic C1 (Education dummy 
variable), the interpretation is that a respondent 


b) Consensus 


{C, =C} 


value of 0 or | offers evidence towards strategic 
consensus and not-consensus respectively (as well 
as a level of ignorance). The thin lines show the 
actual continuous movements of the underlying 
functional forms of the mass values (see Figure 
1). For the characteristics, C2, C3 and C4, the 
functional forms towards the evaluation of the 
mass values are the same, with a value 0 offer- 
ing only ignorance, and a | value offering more 
evidential supportto being strategic not-consensus 
from C2 and C3 and strategic consensus from C4. 
These findings suggest that the priority attached 
to defending by middle managers in education, 
social services and environmental services is 
more likely to diverge from the top management 
team than their counterparts in housing and lei- 
sure services. More detailed investigation of the 
management strategies in each service department 
could reveal whether this reflects service-specific 
considerations. 

For the characteristics described by Likert- 
based valued responses (C5 to C8), the graphs 
range over the domains of response values given 
amongst the 38 respondents. For C5, as the value 
goes from | up to 2 there is decreasing evidence 
towards strategic not-consensus, and for the values 
3 up to 7 there is increasing evidence towards 
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Figure 4. Contribution graphs for organization characteristic values in terms of their characterist: 
BOEs, for, C1, ..., C8 (using OBC in configuration of CaRBS system) 
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strategic consensus (with the concomitant level of 
ignorance changing accordingly). This confirms 
our hypotheses that centralization is positively 
associated with strategic consensus. Similarly, the 
graphs for C7 and C8 confirm the arguments that 
planning is positively related to consensus, but 
that environmental uncertainty has the opposite 
relationship. However, the graph for C6 suggests 
that managerial autonomy may be positively 
rather than negatively related to consensus as 
expected. It is conceivable that middle managers 
in government organizations may have an in-built 
tendency to share top managers’ views on the 
significance ofa defending strategy. Investigation 
of the relationship between managerial autonomy 
and consensus on alternative strategic priorities 
(e.g. an innovative strategy of prospecting) could 
therefore throw further light on this important 
topic. In summary, the directions of contributions 
of the characteristics, follow the signs of the k, 
values in Table 4. 
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RCaRBS Analysis 


This section undertakes an RCaRBS analysis of the 
strategic consensus data set. Rather than thinking 
in terms of the binary classification of respondents 
to strategic consensus and not-consensus, the 
continuous consensus values associated with the 
respondents are normalised over the domain 0 te 
1, signifying a level of strategic consensus (now 
from not-consensus (0) up to consensus (1)). This 
highlights how CaRBS can be adapted for use with 
the continuous variables most commonly found 
in organizational research. 

The utilisation of the objective function OBR 
in the configuration of the RCaRBS system is te 
minimise the predictive error (‘sums of squares 
error’) between the predicted and actual levels of 
strategic consensus of the respondents. The same 
consistent domains over the control variables 
incumbent in CaRBS were used, set as; —3 < ks 
3,-2 $6,<2,0<A4,<1 and B <0.6 (see Beynon. 
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Table 6. Control variable values associated with respondent perceived organization characteristics, 


using OBR in configuration of RCaRBS system 
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Figure 5. Simplex coordinates of characteristic and respondent BOEs for R, and R,,, using OBR in 


configuration of RCaRBS system 


a) R, {C, =C} 


2005b). The TDE method was again employed 
based on the previously defined parameters and 
run fivetimes, each time converging toan optimum 
value, the best out of the five runs being OBR = 
1.444, The resultant control variable values found 
from the best TDE run, using OBR, are reported 
in Table 6. 

A brief inspection of these results shows a lack 
of consistency in any of the sets of parameters 
across the different characteristics. In the case of 
the k, variable values, this is in contrast to the 
same absolute values consistently found in the 
initial CaRBS analysis. The variations in the 
other control variable values (again those in the 
CaRBS analysus), are most appropriately consid- 
ered in the resultant characteristic BOEs found 
for the respondents. 


The characteristic BOEs describe the evidential 
support from all the organization characteristics 
to a respondent’s level of consensus (respon- 
dents, R, and R, are known to have 0.683 and 
0.137 levels of strategic consensus (normalised 
values), respectively). The interpretation of the 
characteristic and respondent BOEs associated 
with the respondents, R, and R,,, is undertaken 
here only through their representation as simplex 
coordinates in a simplex plot, and subsequent 
mapping to single predicted values, see Figure 5. 

Figures Saand 5b, offer a visual representation 
of the evidence from the characteristics to the 
regression-type analysis of the respondents, R, 
and R,,. In each simplex plot the contribution of 
the characteristics BOEs is shown, along with the 
respective respondent BOE, and its mapping down 
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Figure 6. Simplex plot based representation of final respondent BOEs, and subsequent mappings, use 


OBR in configuration of RCaRBS system 


a) Not-Consensus {C, =C} 


Sane aaa 


G 


y 


to the base line ofthe simplex plot over the domain 
0 to 1 (following RCaRBS). Also shown below 
the simplex plots are the actual levels of strategic 
consensus associated with each of the two respon- 
dents. 

The process of positioning the predicted level 
of strategic consensus of a respondent can be 
undertaken for all 38 considered respondents, see 
Figure 6 (respondents partitioned based on having 
actual levels of strategic consensus less than or 
greater than 0.5 - in terms of their standardised 
values). 

In Figure 6, the respondent BOEs of the 38 
respondents are mapped to the base of the simplex 
plots, giving their predicted level of strategic 
consensus, and below them their actual levels of 
strategic consensus. Inspection of the two simplex 
plots shows the general trend of the predicted 
levels of strategic consensus are to the left and 
right in Figures 6a and 6b for respondents with 
actual levels of strategic consensus less than or 
greater than 0.5. 

Consideration of the contribution of the indi- 
vidual characteristics can be graphically gauged 
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b) Consensus IEG} 


from combining stages a and b in Figure 1, forthe 
individual characteristics C1, ...,C8, see Figure 7 

The graphs in Figure 7 are similar to those 
presented in Figure 4 (part of the CaRBS analysis 
These graphs, in terms of direction of contribution, 
follow the signs of the k, values in Table 6. In the 
case of characteristic C2 (Social services dumm 
variable), the k, values found in the CaRBS 
(-3.000) and RCaRBS (2.238) analyses are of 
different sign. However, in the case ofthe RCaRBS 
analysis, the low A, value means particular pre- 
dominance of ignorance across its domain of 
evidence. Despite this minor inconsistency, the 
findings for the perceived organizational charac- 
teristics are broadly the same as those presented 
for the CaRBS analysis, 


FUTURE TRENDS 


The future trends associated with inference from 
the work in this chapter are twofold. Firstly there is 
the potential for the continued technical develop- 
ment of evidence-based data mining techniques. 
In particular, the CaRBS and RCaRBS analyses 


An Exposition of CaRBS Based Data Mining 


Figure 7. Contribution graphs for organization characteristic values in terms of their characteristic 
BOEs, for, C1, ..., C8 (using OBR in configuration of RCaRBS system) 
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presented here, with their fundamentals based 
on the Dempster-Shafer theory, furnish a good 
example of how such novel non-parametric ap- 
proaches may revealnew insights in ahostofareas 
where data mining is an important consideration. 

Secondly, there is scope for greater application 
ofCaRBS and related techniques to other relevant 
issues in organizational research. To date, studies 
of organizational behaviour and outcomes have 
predominantly drawn on parametric techniques 
such as multiple regressions. Indeed, even rela- 
tively established non-parametric techniques, like 
neural networks, have had only limited impact 
in this field. 

This chapter therefore represents an important 
response to the growing clamour for the introduc- 
tion of novel and innovative alternatives to con- 
ventional approaches to data mining and analysis 
within organization and management science. 


i) 
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CONCLUSION 


The acquisition of data-based knowledge is es- 
sential in organizational research; nowhere more 
so than in studies of public and governmental 
organizations. By providing vital information on 
the interrelationships between key organizational 
variables, data mining can enable public policy- 
makers and managers to address the implications 
of organizing within the complex networked set- 
tings in which they increasingly operate. Indeed, 
given heightened governmental interest shown in 
strategic planning in the public sector (see Bryson, 
2004), the preliminary analysis presented here il- 
lustrates how organizational design has important 
implications for strategic management in public 
organizations. 

However, despite calls for more use of uncon- 
ventional techniques to explore organizational data 
across the public and private sectors, as yet, the 
prevalence of Gaussian regression based forms of 
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analysis, has restricted the impact of data mining 
approaches within the field of organizational stud- 
ies. Nevertheless, as the limitations of standard 
linear regression analysis become ever more 
apparent, it is highly likely that a more positive 
attitude towards notion of data mining will emerge 
amongst organizational researchers. 

It is hoped that this chapter offers some 
evidence on the, as yet, untapped potential for 
uncertain modelling of organizational behaviour 
to move the field forward. 
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KEY TERMS AND DEFINITIONS 


Confidence Factor: A function to transform 
a value into a standard domain, such as betwees 
0 and 1. 

Equivalence Class: Set of objects considered 
the same subject to an equivalence relation (e.s 
those objects classified to x). 

Evolutionary Algorithm: An algorithm that 
incorporates aspects of natural selection or sur- 
vival of the fittest. 


Focal Element: A finite non-empty set of 


hypotheses. 

Mass Values: A positive function of the leve 
of exact belief in the associated proposition (fo- 
cal element). 

Objective Function: A positive function of the 
difference between predictions and data estimates 
that are chosen so as to optimize the function or 
criterion. 

Simplex Plot: Equilateral triangle domain 
representation of triplets of non-negative values 
which sum to one. 

Uncertain Modelling: The attempt to repre- 
sent uncertainty and reason about it when using 
uncertain knowledge, imprecise information, ete. 
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ABSTRACT 


In the field of information technology (IT) enabled business networks and research the traditional data 
mining approach is theoretically and practically inadequate for knowledge eliction and management 
requirements in inter-organizational collaborative business environments. The issues are mostly related to 
fundamentally and philosophically narrow conceptions of the meaning of information, and are grounded 
to the metatheoretical implications of positivistic, nomothetic and objective view of reality that restricts 
the feasibility of research oriented application based on them. Here a novel research framework for 
network-wide knowledge discovery is presented that is based on sociologically anti-positivistic, ideo- 
graphic and subjective view of society construed from social facts. The theoretical framework is further 
developed here by synthesizing it with and extracting results from existing research models and artefacts 
originated in analyzing a variety of business networks (for example, a case study concentrating on 
modeling the IT enabled service provision of local travel industry value chain). The main contribution 
here is the explication and elaboration of existing and emerging business network research theories and 
related stakeholder-level practical considerations focusing on topics such as: multidisciplinary research 
conceptualizations, information asymmetry reduction by benefiting from contract law oriented functional 
principles, and network-wide knowledge governance approaches. 
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INTRODUCTION 


Typically, commercial organizations pursue to find 
optimal ways to use, manage, administer, and share 
the critical business information owned by them. 
In its static form, information (or knowledge) can 
reside in various internal and external data sources. 
Traditional data mining then consists of techniques 
that enable the organization to utilize these data 
repositories (for example databases, data marts, 
or data warehouses) to support timely access to 
relevant information (business intelligence) and 
to extract new knowledge for business purposes. 
However, organizational knowledge also has 
a tacit and dynamic nature — especially in the 
context of inter-organizational (customer) rela- 
tionships — that does not easily comply with the 
contemporary methods of knowledge discovery 
(KD). The same applies to the information content 
that many organizations (specifically SMEs) still 
have only in the form of unstructured, unorganized, 
and uncategorized (paper or electronic) business 
documents. 

The intra-organizational and data-centered per- 
spective to business information and data mining 
emerges mostly from objective and positivistic 
assumptions of reality. However, considering the 
nature of inter-organizational communications 
and information flows, it seems evident that there 
is a need for a more subjective, relativistic and 
anti-positivistic view in business and research. 

Also, the multidisciplinary character of sci- 
entific work that concentrates on the complex 
phenomena of network-oriented knowledge 
discovery requires the application of appropriate 
methodologies and explicit and shared research 
conceptualizations. In the University of Lap- 
land, a related example is an on-going two-year 
PROVEM-research project that concentrates on 
modeling and analyzing the service provision ofa 
local travel industry network, which is examined 
through three partly overlapping research areas; 
customer value chains, knowledge management 
and information modeling, and the legal perspec- 
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tives of inter-organizational business relationshies 
and agreements. 

In addition to the above intrinsic limitatice 
of the contemporary data mining approach, it = 
not very well applicable to tackling informatics 
asymmetry, especially in business networks baset 
oncustomer-centric service provision. In the wor’ 
of economy, problems arise when one party of « 
business transaction has more information thas 
the other because this type ofa situation has neg- 
tive implications on business relationships based 
on trust and power (Aaltonen, 2007; Gratzer & 
Winiwarter, 2003). To enable the reduction of 
information asymmetry (Turunen, 2005), the 
mentioned travel industry network research 
in a good position to develop frameworks based 
on organizational information modeling. These 
frameworks can be used to expand the traditions 
data mining paradigm by analyzing the charac- 
teristics of inter-organizational information flows 
and, for example, by specifying the preliminars 
requirements of the information-intensive gover- 
nance of business relationships. 

In order to address these complex phenomena. 
this chapter is organized as follows: the first section 
“Theoretical background” addresses the theoreti- 
cal aspects of knowledge discovery research is 
business network context by introducing a nove 
information technology discipline and its philo- 
sophical base grounded on the idea of socially 
construed reality. The main contribution of this 
work is then presented in the section “Multileve 
model for network-wide knowledge discovery™ 
which first exposes the logical structure of tradi- 
tional and here proposed novel KD-process by 
using the construct of sociological paradigms, then 
in the section “Multidisciplinary concept analysis 
of the domain area” the key terminology and the 
conceptual foundations of the main research areas 
(i.e. business networks, organizational IT and 
contract law) in the on-going business network 
project is depicted, and finally these constructs 
are combined in the section “Preliminary research 
framework” to an overall multilevel model and 2 
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research framework (a prerequisite for a related 
research program). The more practically ori- 
ented discussion about linkage of the developed 
model to the anti-positivistic characterization of 
inter-organizational business information, and 
information asymmetry reduction using func- 
tional principles in contract law is the topic in the 
section “Network-wide knowledge discovery and 
communicative information”, which also contains 
an analysis of a information technology enabled 
travel industry network which is used to extract 
preliminary business requirements and discuss 
about the feasibility of the proposed model at 
stakeholder-level. Finally, in the section “Future 
trends and conclusion”, the main themes of this 
chapter are projected as prospective future knowl- 
edge discovery trends and research enablers, and 
a concluding summary is given. 


THEORETICAL BACKGROUND 


The theoretical background of this work is based 
on the Discipline of Information Technology 
(DIT), a new science concept within information 
technology. (Kamaja 2009) It is used here to show 
the nature and main characteristics of the required 
paradigm shift in research and in business from 
the traditional organizational data mining to the 
novel network-wide knowledge discovery. The 
construction, substance, and theoretical back- 
eround of DIT have been presented against phi- 
losophy of science and conceptual theory (Kamaja 
2009), and its frame of reference consists of eight 
components or levels (Figure 1): (1) background, 
(2) technological, (3) scientific-theoretical, (4) 
philosophical level of science, (5) metatheoreti- 
cal, (6) organized activity and management, (7) 
scientific program, and (8) theory. 

In what follows, the two important metatheo- 
retical sets of DIT in relation to the content of this 
research are presented: first, the model of so- 
cially construed reality (Searle 1995), and second, 
the utilization of sociological paradigms (Burrell 


& Morgan 1979) within philosophy of science in 
conceptual domain area context analysis. 


Socially Construed Reality 
and Social Facts 


Asstated above, the first metatheoretical set of DIT 
is the construction of social reality (Searle, 1995), 
which is based on the following three ontological 
premises: (i) the existence of one world, (ii) the 
existence of an external world is a prerequisite 
for our thinking, and (iii) the fact that conscious 
states are subjective. A fact is a term expressing 
“the way things are in the world”. The core of 
the construction of social reality is formed by the 
concepts of fact, social fact, and epistemic com- 
munity where the structure of perception is based 
on the psychology of thought (Saariluoma, 1990). 

Facts depending on the observer are ontologi- 
cally subjective and facts not depending on the 
observer are ontologically objective. Ontologi- 
cally subjective but epistemologically objective 
facts, referred to as social facts, thus become the 
salient categories. A social fact is a fact contain- 
ing collective intentionality, that befalls those 
actors (two or more) who possess a belief, desire, 
objective, or some other intentional state and who 
know they share this state. According to collec- 
tive intentionality, instead of thinking that 7 am 
striving for something an individual thinks that we 
are striving for something, which creates a “sense 
of collectivity.” Consequently, the intentionality 
possessed by each individual derives from their 
shared collective intentionality. Social facts thus 
stem from one’s belief that one shares his or her 
intentionality with someone else; they are es- 
sentially facts possessed by an individual as a 
member of a community. Classical example of 
these is money, which takes on its meaning only 
if we all believe in its value. This is basically the 
nature of all concepts shared by us. Their mean- 
ing does not depend on the observer but they are 
not included in the world itself either; they are 
ontologically subjective, and at the same time 
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Figure 1. The components (levels) of DIT 


Levei | Com ponent 
Background 


Description 


Includes the original science concept of information technology (IT) that is bases 
on the frame of reference of the Association for Computer Machinery and the 
Institute of Electrical and Electronics Engineers (ACM/IEEE) and their concept 
ofcomputing divided into five computer-related fields of science: Computer 
Science (CS), Information Systems (IS), Computer Engineering (CE). Software 
Engineeri ing (SE). and the original area of information technology aT) 


2 Technologcial 


The technolo gic cal “substance of DIT is based on the corr esponding substance of 
IT complying with ACM/IEEE. 


Includes the creation of a conceptual and theoretical basis for science, the 
definition ofa theory and theoretical meaning, the specification of the 
requirements of critical and advanced science, and the definition of the 
characteristics of scientific identity, which requires that its research objects. 
phenomena, and substance have individual qualities in comparison to other 
sciences. 


F 3 Scientific- 
theoretical 


Presents ontological and epistemological choices and the picture of man. An 
ontological choice is based on the three-world ontology (Popper, 1972): (1) the 
physical world, (2) the world of mind, and (3) the world of constructions. The 
structure of perception is based on the psychology of thought, which forms the 
core of an epistemological choice (Saariluoma, 1990). 


Philosophical 
level of science 


th 


Covers two sets of models: (i) the first set is based on philosophical construction 
of social reality, social facts, and epistemic societies (Searle, 1995), and (ii) the 
second set builds on a theory of four sociological paradigms in relation to 
organizational analysis (Burrel & Morgan. 1979). 


NMetatheoretical 


The most important theoretical frame of reference is the multiperspective 
representation view of organization theory (Hatch & Cunliffe. 2006). that 
presents three overall conceptions of the (alleged) appearance of the surrounding 
world: (i) modern, Gi) symbolic (interpretational), and (iii) postmodern 
perspectives. For each perspective. the theory introduces ontolegical and 
epistemological hypotheses. focusing on the organizational viewpoint and 
organization research. 


6 Organized 
activity and 
management 


fis 


T Scientfic 
program 


The core and protective layer of a research program are constructed in which the 
core contains the fundamental assumptions. theoretical concepts, and theories. 
The core of DIT consists of a three-world ontology (Popper, 1972) and the 
epistemology consists of the structure of perception based on the psycholog gy of 
thought and conception of social facts (Searle, 1995). It is structurally 
surrounded by a protective layer that includes the theories and concepts graduali» 
generated by the field of science in question. 


The defined theory of DIT will guide DIT-related research where the focus is in 
information and its significance to the operation of an organization. 


Theory 


these features of our society exceed epistemic 
subjectivity. (Searle 1995) 


Sociological Paradigms and 
Conceptual Analysis 


DIT’s second metatheoretical construct consists 
of conceptual areas and sociological paradigms 
within philosophy of science. (Burrell & Morgan 
1979) This model encompasses (i) four concept 
areas within philosophy of science (ontology, 
epistemology, human nature, and methodology) 
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with two opposing concepts forming a pair witha 
each, (ii) two key dimensions (subjectivity — ob- 
jectivity and regulation — radical change) that 
form a four-sector model, (iii) a paradigm from 
the social sciences connected to each sector (i.e 
radical humanism, radical structuralism, interpre- 
tatism, and functionalism), and (iv) the generally 
accepted scientific conceptions assigned to each 
paradigm and agreeing with their requirements 
Based on the above proposed developments 
in philosophy of science in relation to conceptual 
theory and supported by the multidisciplinar 
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Figure 2. The four sociological paradigms (Burrell & Morgan, 1979) used to organize contextual do- 


main areas 


Radical 
change 


Radical humanism 
voluntarism 


(B.1): Critical theory 
(B.2): Postmodernism 


(C): Business ecosystems and commercial 
competitive environments), SMEs 


(D): Internet 


Regulation Interpretatism 


voluntarism 


| (B.1): Hermeneutics. phenomenology 
(B.2): Symbolic interpretation 


(C); Business networks (supply chains?) 


nature of DIT, the two-dimensional framework 
of sociological paradigms is employed here as 
a theoretical construct to organize the follow- 
ing categories of conceptual domain areas: (A) 
philosophy of science, (B) scientific views, (C) 
business network research, and (D) organizational 
IT. On the one hand, the alignment of the indi- 
vidual concepts within these domain areas (Figure 
2 below) is based on existing philosophical and 
sociological theories. On the other, it is based on 
a preliminary analysis of the research conceptu- 
alizations relevant to this chapter. 

As they have originally been devised for it, 
the four sociological paradigms are well suited 
to position the various orientations and principles 
within philosophy of science (A) (Burrell & 
Morgan 1979) in relation to scientific views (B) 


| Subjective 


(A): Nominalism, anti-positivism, ideographic. 


markets (in turbulent, external consumer driven 


| (A):Nominalism, anti-positivism, ideographic, 


(D): Personal computing, semantic webs, etc. 


| mining 


Objective 


Radical structuralism 
(A): Realism, positivism, nomothetic, determinism 


(B.1): Conflict theory 
(B.2): - 


(C): - 


(D): Communities of social media 


Functionalism 
(A): Realism. positivism. nomothetic, determinism 


(B.1): Objectivism. integrative theory 

(B.2): Modernism 

(C): Big hierarchical and bureaucratic organizations 
(also the "internal stability features" of private 


enterprises) 


(D): Classical information systems, traditional data 


(Hatch & Cunliffe, 2006). In philosophy of sci- 
ence, the conceptual areas and opposing pairs are 
defined as follows (Burrell & Morgan, 1979): (i) 
ontology: nominalism - realism, (ii) epistemology: 
anti-positivism - positivism, (iii) methodology: 
ideographic - nomothetic, and (iv) human nature: 
voluntarism - determinism. In respect to the clas- 
sical scientific views like critical theory, conflict 
theory, phenomenology, objectivism, and herme- 
neutics (B.1), they are usually separated by a 
degree of commitment to a set of philosophically- 
oriented principles. An organizational interpreta- 
tion of the these principles is the above described 
multiperspective view (Hatch & Cunliffe 2006) 
that consists of three mutually exclusive principles 
(B.2): postmodernism, modernism and symbolic 
interpretation that have been mapped respec- 
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tively to the paradigms of radical humanism, 
interpretatism and functionalism. 

When the same approach is applied to orga- 
nizing conceptual domain area contexts related 
to this research, i.e. business network research 
(C) and organizational information technology 
(D), the important theoretical relationships of 
the context areas can be shown. It is possible to 
carry out a sociological analysis of these domain 
area-specific conceptualizations, and this pos- 
sibility has useful methodological implications. 
The core conceptualizations in business-oriented 
network research (C) are usually centered on the 
structural analysis of the topology, organization, 
and functions of net-like structures, which in their 
most generic form consist of anumber of intercon- 
nected nodes. In respect to the surroundings (or 
environments) of the constellations of member 
organizations and depending on the chosen unit of 
research analysis, a set of net-like structures can be 
identified: business ecosystems, industry domain 
areas, virtual enterprises, collaborative business 
networks (CBN), industrial business networks, 
supply chains, large multinational corporations, 
SMEs, and micro-enterprises. After aligning these 
to the paradigmatic framework, also the concepts 
in the domain area of organizational information 
technology (D) can be positioned accordingly. 
The preliminary domain area analysis conducted 
as part of this research suggests that this category 
consists of the following set of relevant concepts: 
the Internet, communities of social media, personal 
computing, traditional information systems, and 
data mining. 

In sum, the theoretical benefit of applying the 
paradigmatic framework tothe mapping of several 
contextual domain areas is that itenables us to show 
the conceptual interconnectedness of the entities 
essential to this research. The paradigmatic nature 
of framework means that concepts, models, and 
theories in each segment are closely interrelated 
and thus, in the strictest sense of the theory, even 
mutually exclusive. The proximity ofentities such 
as data mining, classical information systems (for 
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example databases), nomothetic methodolom 
positivism, and large hierarchical companies am 
the fact that they are all attached to the objective 
regulative (i.e. functionalistic) paradigm sugges 
that the concept of traditional organizational dex 
mining is reflecting only the objective view 
society and the positivistic conception of data = 
well as the cumulative nature of information t= 
is mostly stored in static data sources of = pz 
hierarchically regulated organizations. In orde 
to support knowledge discovery in the contes= 
of business network research and operations, me 
improvement of classical data mining method 
with some new algorithms is clearly not su® 
cient. Instead, a more paradigmatic shift from » 
positivistic and objective view of reality to an an1- 
positivistic and subjective view of KD in social 
construed reality is needed both in research and 1 
business management. This applies particulary 
to organizations in modern commercial business 
ecosystems and competitive electronic markets 


MULTILEVEL MODEL AND 
RESEARCH FRAMEWORK 
FOR NETWORK-WIDE 

KNOWLEDGE DISCOVERY 


The main objective of this chapter is to develop 
a theoretically grounded and practically feasible 
model for network-wide knowledge discovers 
in real-world business network research contexts 
The starting point for this is the differentiation of 
the logical structure of the KD-process into the 
traditional data mining approach (the positivistic 
and organizational view) and a progression from 
socially construed reality to the evolution of 
epistemic society (the anti-positivistic network 
oriented view). This analysis is grounded on 
the previously outlined theoretical background 
of philosophy of science, the four sociological 
paradigms and the model of the discipline of in- 
formation technology (DIT). Then a preliminary 
conceptual analysis of the phenomenon under 
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Figure 3. Layered progression of knowledge discovery: positivistic and anti-positivistic paradigms 


Logical layers and [Objective view of society 


ltransitions 


[Subjective view of society 


(positivistic paradigm) (anti-positivistic paradigm) 
Layer 0: Evolution of epistemological society 


Networks’commiunities 


by communicative knowledge discovery 


Transition 1 -> 0: | 


Information asymmetry reduction 


organizational decision making 


Communicative patterns. 


N oe Functional principles of contract law: 
"Evolutionization" p p poe 
reasonableness. good faith and trust 

Layer 1: Information pattern eee co EE < 

> : A 2 ote : Innovations and insights. 
Stakeholder‘organization | presentations: reports and oe 3 ae 
TEP = ; Critical knowledge assets. 
specific analyzes supporting = 


Transition 2 -> 1: 
Functionalization 


Traditional data mining 


Information intensive governance and 
network-wide "data mining" 


Layer 2: 
Storage and persistence 


Datasources & databases 


Object-relational and semantic repositories, 
ontologies a 


RDBMS (ER-paradigm) 


Transition 3 -> 2: 


: T : Ad-hoe cumulation 
Aggregation (information) 


Layer 3: 
Gathered observations and 
content corpus 


representations data 


Static, numeric and/or symbolic 


OODBs, RDFS- and OWL-repositories 
Aggregation. interpretation, classification, 
Multidisciplinary Concept Evolution (MCE) 


Social facts and conceptualizations 
Dynamic. interpretative and relative 
information flows. 

Natural language processing. TAT-tools, 
Hike ATLAS.ti 


Transition 4 -> 3: 
Observations - 
| (socialization?) 
Layer 4: 
Sources and informants 
Entities and events 


Objective reality based 


study is conducted from the relevant research 
area perspectives, and finally based on these the 
research framework for network-wide knowledge 
discovery is presented. 


Logical Structure of the 
Knowledge Discovery Process 


By utilizing the sociological framework for orga- 
nizing the relevant conceptual domain contexts, 
this section presents the nature and characteris- 
tics of a paradigmatic shift from traditional data 
mining to the requirements and possibilities of 
network-oriented knowledge discovery. First, 
both the anti-positivistic and positivistic scenario 
is presented in detail. The discussion is concluded 
with a logical structure diagram of the process- 
oriented progression of KD from the fundamental 


on the world view of realism. 


Observations (based on mental models or 
schemas), pre-conceptualizations. 
Sub-conscious cognitive processes 


Socially constmed reality (Searle. 1995: 
Popper, 1972), communicative behavior 


level of real-world phenomena to the level of 
communities and networks (Figure 3). 


Positivistic View: Traditional Data 
Mining 


According to positivism, truth is reached through 
competent methods and reliable measurement, 
which allows us to test it in the objective world; 
knowledge is cumulative and allows us to make 
progress and develop. The nomothetic approach 
examines the orderliness of phenomena and their 
causal connections through statistical generaliza- 
tions. It typically involves the utilization of one 
theory per case and at least a moderate number 
of statistical observations. Quite often it entails 
the testing of theory-based hypotheses that are 
analyzed statistically using accrued observational 
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material. Traditional data mining is carried out 
according to the principles of nomothetic meth- 
odology. It makes use of statistical calculations to 
establish dependencies and correlations between 
variables and to find out the basic distribution of 
variables. 

One typical case of data mining can be found 
in the area of tourism, where a travel agency sys- 
tematically gathers customer satisfaction feedback 
after each trip from all the travelers. The feedback 
survey is implemented on the Internet using a 
typical questionnaire form. Most of the gathered 
data is classified and encoded, for example, the 
information related to customer satisfaction and 
the customers’ backgrounds. In this case, the data 
source consists of these numeric representations 
that can manually be stored in database, and by 
using traditional data mining (KDD) techniques 
reports can be generated from these. This body of 
information represents quantitative information. 
Among the concepts of philosophy of science, 
the two dominant pairs in this case are the ones 
related to epistemology and methodology; thus, 
the information in this example is epistemologi- 
cally positive and methodologically nomothetic. 


Possibilities of the Anti- 
Positivistic View: A Case for 
Network-Wide Data Mining 


According to the interpretive scientific view, all 
knowledge is relative to the person who possesses 
it, and it can only be understood personally through 
the one who has been part of it. Truth is constructed 
socially through the objective knowledge of sev- 
eral interpreters, and therefore it varies with time. 
The epistemic community then, is based on these 
theories on the construction of social reality and 
on social facts (Searle, 1995). 

The previously presented example ofa tourism 
industry customer satisfaction form can also have 
text fields for customer opinions, comments, and 
other informal information. For example, cus- 
tomer feedback could have been gathered under 
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the questionnaire-heading “Other issues rele: 
to the trip.”, where one customer might have a» 
swered: “birds’ singing in the morning brous 
tears in my eyes”, while the other response cow # 
be: “fresh air gave me a good night’s sleep”. TD» 
kind of content represents anti-positivistic ae 
ideographic information; in terms of scient 
classification it represents subjectivity and = 
symbolic-interpretive view. 

One of the salient questions of this work 
whether some form of knowledge discovery cay 
be applied to this kind of ideographic and amm- 
positivistic information. The positive respons: 
from the writers of this chapter is based on t% 
epistemological findings: new types of methae- 
ological solutions and epistemological objectis- 
ity realized through epistemic communities. * 
methodological solution can be described as a» 
extended content analysis of qualitative materia 
The phases to be implemented in research are: 
the recognition and definition of key concepts, (2 
finding the key concepts from a text and abstra- 
ing (interpreting) these text passages, and (3) the 
aggregation of the abstracted concepts. 


Layered Comparison of 
the Positivistic and Anti- 
Positivistic Views 


As stated above, traditional data mining exhibits 
the positivistic and objective world view. Thenove 

approach to network-wide knowledge discovers 
proposed here is based on the anti-positivistic view 
of socially construed reality. In order to show i 
detail the differences between these paradigms, a 
generalized, layered logical structure (including 
the transitions between them) of the KD-process 
has been created (Figure 3). This model consists 
of: (i) five logical representation layers, (ii) four 
transitions between them producing higher leve 

representations, and (ili) the corresponding posi- 
tivistic and anti-positivistic substance matter of 
the knowledge discovery process. 
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The starting point of the KD (from bottom-up) 
is the level of sources and informants (Layer 4) that 
consists of the entities and events of the phenom- 
ena under investigation. By making observations 
and by the process of socialization (Transition 4 
-> 3) a higher-level content corpus representation 
(Layer 3) is achieved. The aggregation (Transi- 
tion 3 -> 2) of gathered observations enables the 
persistent storage (Layer 2) of newly generated 
information. In order to reach the stakeholder- or 
organization-specific representation level (Layer 
1), the persistence-layer substance is functional- 
ized (Transition 2 -> 1). Here, it is important to 
note that within the positivistic world view this 
step manifests itself as the traditional data min- 
ing process and in the anti-positivistic case it is 
achieved by following the knowledge discovery 
methods proposed in this chapter; these methods 
are based on information-intensive governance 
and communicative information flow. Finally, the 
network operations and representation level (Layer 
0) can be seen as the result of a semi-autonomous, 
self-organizing emergence (Transition 1 -> 0) 
whereby the higher-level organizational structures 
are formed and through which they evolve. 


MULTIDISCIPLINARY CONCEPT 
ANALYSIS OF THE DOMAIN AREA 


The research here focuses on inter-organizational 
knowledge discovery and the related real-world 
phenomena thatare intrinsically multidisciplinary. 
There are some scientific views and research 
approaches in literature that are applicable to 
research settings where a common set of issues 
and problems are explored from varying but over- 
lapping perspectives. The scientific challenges of 
developing, creating, and utilizing novel business 
models and practices based on network-wide 
knowledge and enabled by information technology 
will be confronted by the theoretical backgrounds 


of philosophy of science and DIT theory presented 
in the previous section. In this kind of research 
mutually shared conceptualizations are required 
on which the higher-level science artifacts (i.e. 


` constructs, models, methods, and implementa- 


tions) can be based (Aaltonen et al., 2006). 
Methodologically the work presented here 
has been founded on the constructive research 
approach (CRA) which is a form of action re- 
search that uses case-study methods to solve 
predetermined real-world problems and to 
generate new scientific knowledge of the area 
under study (Aaltonen ef al., 2006; Kasanen et 
al., 1993). In the sub-sections below, the results 
of the multidisciplinary domain analysis sup- 
porting the objectives of this work is outlined. 
The preliminary research conceptualizations are 
presented in form of key terminology in the fol- 
lowing research areas: (i) business networks, (ii) 
organizational IT, and (iii) contract law. Finally, 
the content of the shared conceptual domain area 
of all the individual research fields is specified. 


Networks and Business 
Environments 


According to the most generic specification, a 
network is a net that consists of point-like nodes 
and connections between them. In alignment 
with the graph theory and depending on how 
the nodes and links are characterized, a set of 
layered topologies that exhibit various connec- 
tivity structures and patterns (for example social 
communities and business environments) can be 
represented and analyzed. Below, an overview 
is given of nets embedded in various business 
environments which area specific case of net-like 
structures (refer to Figure 4 for details (Aaltonen 
et al., 2007b; Choi & Stahl, 1997; Hakansson & 
Johanson, 1992; OASIS, 2006)), followed by a 
brief discussion on the governance and informa- 


297 


Data Mining in the Context of Business Network Resear=> 


Figure 4. Mapping the elements of various net-like structures (Aaltonen et al., 2007b) 


Net-like structures | Economic markets | Industrial Networks | Service paradigm 
(Choi & Stahl, 1997) | (Håkansson & | (OASIS, 2006) 
Johanson, 1992) 
environment | market network | context 
nodes | participants actors | services (agents) 
| resourses | products and services | resources | capabilities 
dynamics ‘transactions | activities “interaction 
| processes | 
j z 
connectivity relationships, actor bonds composition, 
¡economic exchange: | resource tiers | choreography, 
‘trust and power activity links “orchestration 
‘relations, | 
‘information flows | 


tion technology adoption in business networks 
within the travel industry. 


Net-Like Structures 


It is useful to employ the net-like structure in the 
identification and specification of the common 
features of nets. A net is a net-like structure if it 
consists of: (i) nodes (or actors), (ii) connectiv- 
ity, meaning the links or relationships, (iii) the 
environment, which is the logical context or scope 
of the net in question, (iv) resources (or assets) 
that can be consumed or transferred between the 
nodes, and (v) the dynamics, which represents the 
various interactions and communications between 
the actors of a net. Even if the context or the level 
of organizational and structural complexity vary 
considerably between nets, it is still possible to 
align them by, for example, comparing the com- 
mon constitutional elements in each case. 
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BUSINESS NETWORKS 


Networks are a form of governance between 
arms-length relationships of markets and t 
highly integrated organizations of hierarchic: 
The specialization of nodes and low transactio 
costs between nodes are the key characteristics = 
networks. Transactions and relationships are uss 
ally coordinated by shared information systems 

Business networks consist of actors, resources 
activities and their interconnections. Actors per 
form interconnected transformation and transfe 
activities that demand resources. Through networs 
the actors may gain access to resources controlled 
by other actors. (Håkansson & Johanson, 1992 
Activities are connected with flows of information. 
materials, finance and influence and ultimate) 
they create value for the customers (Parolin: 
1999). Business networks that exist mainly for 
collaboration are called collaborative business 
networks (CBN), and they aim in bringing togethe 
the knowledge, expertise and other resources of 
the actors (Kotler et al., 1993). 
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GOVERNANCE OF 
BUSINESS NETWORKS 


The governance of business networks is a com- 
promise between control and emergence: control 
deteriorates innovation and flexibility and emer- 
gence erodes routines and predictability (Choi et 
al.,2001). Networks are managed and coordinated 
in terms of knowledge, communication, decision 
making, price, authority and social relationships 
(Kohtamäki, 2005; Zettinig, 2003). The funda- 
mental network management functions are: (1) 
framing: forming a vision of value creation and 
communicating it, (2) activating: realizing the 
structure and pattern of actors, resources and 
activities to create value, (3) mobilizing: building 
commitment among actors toward mutual value 
creation, and (4) synthesizing: monitoring and 
measuring value creation and facilitating interac- 
tion (Järvensivu & Möller 2008). 


DOMAIN SPECIFIC BUSINESS 
NETWORKS: TOURISM INDUSTRY 


Tourism is a network industry par excellence due 
to its fragmented nature, actor interdependence, 
collective resources and production (Scott et al., 
2008). A traditional tourism value chain consists 
of four stages: suppliers of basic tourism services, 
tour operators (wholesale), travel agents (retail) 
and consumers (tourists). A central problem therein 
is the matching of demand and supply (the issue 
of supply chain management, SCM), which em- 
phasizes the role of intermediaries. 

A tourism network can be defined as a set 
of formal, co-operative relationships between 
organizations that is formed to achieve a par- 
ticular purpose in the tourism business. Tourism 
production networks consist of producers and 
users of different services and are coordinated 
by interaction between actors. These networks 
rely on the creation, gathering, communication 
and application of operative type of information, 


learning and exchange of knowledge that is guid- 
ing day-to-day activities. 


Organizational Information 
Technology 


In this section the main conceptual entities rel- 
evant to this work that belong to the intersection 
of information technology and knowledge inten- 
sive organizational management and operations 
are presented. An overview about the hierarchy 
of the internal conceptual structure of informa- 
tion is given followed by a short description of 
traditional data mining. 


DATA, INFORMATION, 
AND KNOWLEDGE 


The core substance of information technology 
in general and information system sciences in 
particular is the representation and processing of 
information. Also in modern organizations the 
utilization and management of various forms of 
business information are among the critical suc- 
cess factors in competitive [T-enabled trading 
environments. However, the conception or the 
understanding of information varies considerably 
when viewed from the disparate perspectives of 
business and technology. When applying the re- 
sults of this research in the contexts of business 
and science, it is therefore essential to explicate 
the meaning of information in a mutually agreed 
way, and optimally set it in a philosophically 
reasoned and logically sound relationship with 
data and knowledge. Thus in short, knowledge 
is applicable or usable collection of information, 
which consists of processed or interpreted facts, 
symbols or marks (i.e. data). 

Also, the typical hierarchical relationship be- 
tween data, information, and knowledge is based 
on the inherent level of abstraction in each (data 
at the lowest level and knowledge at the highest). 
Furthermore, it is based on the idea of contain- 
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mentor the ’prerequisite principle,” which simply 
means that information cannot emerge before 
some data is processed and/or interpreted and that 
knowledge emerges only if there is information 
available and if it is applied or used for some par- 
ticular purpose. In system-analytical thinking, this 
hierarchical continuum is sometimes extended to 
entail the notions of understanding and wisdom, 
in which case the presupposition is that all five 
categories exist in and represent the content of 
the human mind (Ackof 1989). 


DATA MINING 


Traditional data mining (DM) typically occurs 
in the organizational context where existing 
structured and static data sources (i.e. databases) 
function as the source of the actual information 
and knowledge extraction process. Based on this, 
data mining is sometimes referred to as knowl- 
edge discovery in databases (KDD), but there 
are several other definitions for DM in literature: 


e the step in the process of knowledge dis- 
covery in databases that inputs predomi- 
nantly cleaned and transformed data, 
searches the data using algorithms, and 
outputs patterns and relationships to the 
interpretation/evaluation step of the whole 
knowledge discovery process in databases 
(Fayyad et al., 1996) 

e the science of extracting useful informa- 
tion from large datasets (Hand et al., 2001) 

e the process of the exploration and analysis, 
by automatic and semi-automatic means, 
of large quantities of data in order to dis- 
cover meaningful patterns and rules (Berry 
& Linoff, 1997) 


As can be inferred from these descriptions, the 


requirements of networked business governance 
and the characteristics and demands of the inter- 
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organizational operating environments of targe* 
organizations are mostly outside the scope of 
traditional DM. Also, within the field of dats 
mining itself, it is widely accepted that traditions 
techniques and methods may be inadequate whee 
vast amounts of multi-dimensional data are dis- 
tributed among heterogeneous data sources and 
shared by various stakeholders. 


Contract Law and 
Functional Principles 


Contract law is defined as a branch of law where 
the main focus is on a contracting party’s free 
will to commit itself to a binding obligation. A 
contractual obligation may be fulfilled instantly 
or during a longer period of time. In latter cases a 
contract may be defined as a long-term contract. 
which is actually a contract governing continu- 
ous, long-lasting, and cooperative contractua 
relationships. (Nysten-Haarala 1998) In long- 
lasting relationships contracts are often used for 
information asymmetry reduction. 


CONTRACTUAL PRINCIPLES 


Contractual principles derive directly from con- 
tract law and are sometimes defined as a cohesive 
element of contract law. (Pöyhönen, 1988) In 
order to reduce information asymmetry, contrac- 
tual relationships build up an internal governing 
method; they work together to govern a long-term 
relationship. In this writing, although contract 
law is not applicable to all kinds of network 
structures, some of its principles are chosen for 
describing the governance of networks, enabling 
knowledge sharing, and for reducing informa- 
tion asymmetry. Here, these principles are called 
functional principles, the term functional coming 
straight from the governing and activating nature 
of these principles. 
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FUNCTIONAL PRINCIPLES 


Functional principles constitute the most operative 
frame of network governance. Here, the func- 
tional principles are defined as fairness/equality, 
good faith/fair dealing and trust/confidentiality. 
Fairness/equality concentrates on balancing a 
relationship and by these means reducing infor- 
mation asymmetry. Good faith fair dealing, on its 
part, makes relationships and networks operative 
and functional together, and trust/confidentiality 
consolidate and improve activity itself. 


Shared Conceptual Domain Area 


In a multidisciplinary research setting it is typical 
that some of the key concepts are shared by several 
individual research areas. Here, the justification 
for the following preliminary discussion about 
phenomena such as information asymmetry, 
information intensive business governance, and 
setwork-wide knowledge management is that 
they form the central terminology for the ultimate 
conceptualization and methodological specifica- 
tion of the proposed network-wide knowledge 
discovery research. 


INFORMATION ASYMMETRY 


in relation to net-like structures and economic 
markets, information asymmetry refers to a condi- 
tion in which at least some relevant information 
s known only to some parties involved. One of 
the gravest effects of information asymmetry is 
that it causes markets to become inefficient, since 
all the market participants do not have access to 
mformation they need for their decision making 
processes. By these means, information asymme- 
wry deals with the study of decisions in transactions 
where one party has more or better information 
than the other. This creates an imbalance of power 
which can sometimes cause the transactions to 
become distorted. On the other hand, informa- 


tion asymmetry may easily be considered the 
prototype of modern exchange this being carried 
out in asymmetric informational environment. 
(Lauriala 2001) 

In some exchange trade environments and 
economic markets information asymmetry may 
sometimes be desired or even a necessity for cre- 
ating innovation and growth. (Lamberton 1998) 
But, asymmetric information changes the overall 
operational patterns of cooperative relationships in 
sucha way that it becomes impossible to predict the 
acts of the other party (Virtanen 2001), especially 
when the parties act only for their own good. It 
seems quite clear that this does not work for the 
good of exchange. This implies that increasing 
mutual communication and thereby decreasing 
information asymmetry seems to work for the 
best for the society. (Turunen 2005) 


INFORMATION INTENSIVE 
BUSINESS GOVERNANCE 


Referring to the above discussion about the 
importance of explicating, especially in multi- 
disciplinary research settings, the understanding 
of the key concepts and their relationships, idea 
of the proposed information intensive business 
governance (IIBG) is here discussed in order to 
differentiate it from the more traditional business 
management conventions. 

Business governance has traditionally been 
divided into the following separate functional 
management areas: human resources, operations 
or production, strategic management, marketing, 
finance and lately also information technology. 
In modern enterprises operating in complex IT 
enabled competitive and collaborative network 
environments, organizational governance is more 
properly defined in terms of the identified critical 
business activities (processes) or objects (assets) 
that are subject to management. The predominant 
management practice is the process-based view 
(PBV) of business, where the strategic business 
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objectives and visions are to be realized through 
detailed modeling and engineering, and by the 
cost-effective performance of the core business 
processes. In contrast to this, the resource-based 
view (RBV) of a firm (Barney, 1991) focuses the 
managementto the allocation, utilization, and op- 
timal alignment of the demand and supply of criti- 
cal business resources or assets. While there are 
many business resource categorization schemes 
(Aaltonen et al., 2007b), in the resource based 
strategic management practice the categories used 
are: financial assets and tangible, intangible and 
human resources. In this scheme, the intangible 
resources are of importance because they are infor- 
mation based assets including competencies and 
reputation (Lowendahl, 1997). Thus, information 
intensive business governance is an organizational 
managementapproach that emphasizes knowledge 
and information intensive assets as being the core 
competencies and enablers of successful business. 


NETWORK-WIDE KNOWLEDGE 
MANAGEMENT 


As has been stated above, the governance of 
networks is a complex and controversial issue. 
However, there seems to be a general agreement 
that it should be built on trust and norms and that 
decision rights should be allocated in relation to 
expertise. But when the dimension of knowledge 
is added to the subject matter of network-wide 
governance, many business- and research-oriented 
issues emerge. In practice, the inter-organizational 
knowledge management here means the utilization 
of the results of network-wide data mining that 
should be made available to business stakeholders 
in some form of a shared knowledge repository. 
To solve the problems in designing, developing, 
operating, and managing such a network-wide 
knowledge base in mutual understanding could 
prove to be very challenging. For example, in 
supply chain type networks the naturally existing 
level of knowledge integration is naturally low 
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because the relationships are instrumental ax: 
the communication is directly tied to product 
issues. 

Also, knowledge is often partitioned in relatis 
to expertise and decision rights, but at the sam 
time there is a need for knowledge redundane 
in network nodes. (Konsynski & Tiwana, 20% 
Widely adopted ICT supports innovative wa 
to create cooperative alliances. The exchange 
of information and the ability to interact are ke 
elements in executing business processes im « 
business network (or a business web). And = 
succeed in network-wide process- and system- 
level interoperability, which is a necessity == 
business webs (Hakolahti & Kokkonen, 2006. 
inter-organizational agreements are needed * 
enable the management of supporting colle 
orative business operations and resource-relatec 
knowledge bases. 


Network-Wide Knowledge 
Discovery Research Framework 


The main topic here is to advance the mode!- 
oriented work conducted so far to a preliminar: 
research framework that provide network-wide 
support for identifying and representing the 
existing organizational information contents and 
inter-organizational information flows. This 

done within the field of multidisciplinary bus- 
ness network research from the perspective of 
data mining in order to enable: (i) network-wide 
knowledge discovery and management, (ii) in- 
formation intensive business governance, and 
(iii) the reduction of information asymmetr 
Below, an overview of the preliminary researc® 
Jramework for network-wide business knowledge 
discovery and management is given (Figure 5. 
where the contents (i.e. the models, constructs, and 
methods) have been divided into five scientific 
substance levels (Kamaja, 2009) (from bottom 
up): (i) philosophy of science and metatheory, (ii 
general theories of science, (iii) special theories 
of science, (iv) the analytical level, and (v) the 
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Figure 5. The multilevel research framework for network-wide business knowledge discovery 
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empirical level. The theoretical justification of 
organizing the elements of the model is based 
on the DIT, the components of which are here 
mapped to the relevant levels of the framework. 

According to DIT, there are two metatheo- 
retical constructs at the level of philosophy of 
science and metatheory: the socially construed 
reality (Popper, 1972; Searle, 1995) and the four 
sociological paradigms (Burrell & Morgan 1979). 
The higher, general and special theories of science 
can be built on these constructs. One way to 
achieve this is to formulate a research program 
by using DIT’s organization theory constructs, 
æ. the Hatch’s multi-perspective view. In concrete 
terms, during the prior and on-going DIT-com- 
pliant business network projects at the Univer- 
sity of Lapland, the semantic aspects of inter- 
organizational business communication (for 
example, interoperability issues), and the chal- 
lenges in collaborative research have been ad- 
dressed partly by developing a multidisciplinary 
concept evolution (MCE) framework (Aaltonen 
ef al., 2006; Aaltonen et al., 2007a) that supports 
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a multi-perspective domain area analysis and the 
sharing of research conceptualizations. Also, a 
preliminary service-oriented extension ofthe roles 
linkage model for business networks has been 
proposed (Aaltonen, 2007) preparing a way for 
the service-centered structural analysis of indus- 
trial networks. The model aims to make it easier 
to align the roles that individual enterprises play 
in business networks with the IT-oriented service 
paradigm based descriptions of corresponding 
linkages or business relationship generalizations. 
In addition, generic business information concep- 
tualization models that rest on the semantic and 
dynamic extension of resource based view of a 
firm are useful for an organization in categorizing 
and classifying knowledge based business re- 
sources (Aaltonen ef al., 2007b) as a prerequisite 
for successful information intensive business 
governance (IIBG). In relation to the character- 
istics of inter-organizational information flow and 
information asymmetry reduction the functional 
principles originating from contract law can be 
capitalized in ongoing and future research pro- 
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grams. It is also plausible to apply these principles 
in situations where organizations try to agree on 
the terms of, for example, network-wide knowledge 
management practices. 

In conclusion, enabling inter-organizational 
knowledge discovery requires that the entire 
complex cross-disciplinary research phenomena 
presented in this work and the scientific substance 
of the framework above need to be theoretically 
aligned and conceptually integrated to form a 
seamless whole under a business network research 
program. 


NETWORK-WIDE KNOWLEDGE 
DISCOVERY AND COMMUNICATIVE 
INFORMATION 


In order to elaborate the above proposed multilevel 
knowledge discovery research model and frame- 
work, some practical requirements, enablers, and 
implications are discussed in more detail in this 
section that is organized to sub-sections as fol- 
lows: (i) specification of the inter-organizational 
information flows in respect to the anti-positivistic 
view of society, (ii) discussion about information 
asymmetry reduction through contract law-orient- 
ed functional principles, and (iii) presentation of 
the case study results of an existing tourism net- 
work where the main information and knowledge 
transfer between the key actors were identified. 


Anti-Positivistic Characterization 
of Inter-Organizational 
Business Information 


Decreasing information asymmetry needs to be 
founded on open access to information or on 
increased exchange of information. Thus infor- 
mation itself and its communication need to be 
based on some common rules or basic principles 
of circulating information and getting access to it. 
Such information intensive governance of business 
relationships must observe the main characteris- 


304 


tics of inter-organizational information flows 
general, it could be possible for organizations 3 
pursue to reduce the overall information asym 
metry of the networks they are members of, als 
in their data mining initiatives, by identifying 
and strengthening the enablers correspondieg 
to the relevant communicative information fim 
characteristics: (i) dynamics, (ii) semantics, 
relativity, (iv) usability, and (v) openness ami 
protectability (Figure 6). 

These identified inter-organizational informs 
tion flow characteristics are a manifestation of € 
the anti-positivistic view of society presente: 
previously. In business network context, the & 
namic nature of information (i.e. information = 
arelationship) (Czerniawska & Potter, 1998), fs 
example, is based on the view that inter-orgas- 
zational transactions reflect the transient patter: 
of human behavior, and the reduction of informs- 
tion asymmetry could in this case be achieved 5 
increasing the accessibility to and availability = 
information. The networked organizations cowl 
also benefit from recognizing and reacting to 5e 
other information flow aspects in their busines 
relationships. In line with past and ongoing me- 
tidisciplinary research done at the University < 
Lapland, these notions and insights are elabo- 
rated below. The main contributions here originai 
from the domain areas of contract law, anc 
analysis of digitally enabled business relationshi= 
and networks in the field of tourism. 


Information Asymmetry 
Reduction by Contract Law- 
Based Functional Principles 


In respect to the principles of contract law and = 
order to reduce information asymmetry, informs- 
tion intensive governance can be based on the 
following functional principles: fairness/equa- 
ity (reasonableness), good faith/fair dealing. anc 
trust/confidentiality. Choosing and using these 
principles taken from contract law is justifies 
by the cooperative nature of networks and thee 


Data Mining in the Context of Business Network Research 


Figure 6. Inter-organizational information flow characteristics and asymmetry reduction enabler 


Characteristic | Description 


Information asymmetry reduction (and network- 
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dynamic nature 
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(Czemiawska & 
Potter, 1998) 
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Explicate the meaning: 


makes it usable for a wider range 
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semantics conceived and semantically -> requires the construction of common and shared 
“annotated” by the participants. | conceptualizations, ontology engineering (Aaltonen er 
al., 2007a) 
relativity The stakeholders asses the Explicate the value functions 
information according to varying | -> requires semi-formal value creation models 
value functions (Shapiro & 
Varian, 1999) 
usability Communicating information Makes it possible to target the information more 


precisely 
-> customer intensive network analysis methodology 


openness & legal 
protectability 


The transparency (but also the 
vulnerability) of information 
increases in communicative 


actions. 


Implement property rights management. privacy and 
information security guidelines and policies 

-> inter-organizational information security model 
(Aaltonen er al.. 2008) 


relationships. Mainly the same principal basis that 
makes a network functional operates on the back- 
ground of contract law as well. These very three 
principles are chosen because of their dynamic 
nature, their strong emphasis on reciprocity and 
mutuality, and their intrinsic relevance to the pro- 
posed anti-positivistic network-wide KD-process. 


Fairness/Equality as 
Interpretation and Relativity 


Information is an activity and it is experienced 
rather than possessed. (Shapiro & Varian, 1999) 
Inthe business network environment, information 
is based on inter-organizational business transac- 
tions where human behavior and communication 
constitute the core. This view makes information 
relative. As information seems to be based on 
fluent communication, it seems plausible to build 
instruments enabling the reduction of informa- 
tion asymmetry based on the functional nature 
of information. 


On the other hand, communicative actions 
increase the transparency of information and bring 
some openness to information flows. Information 
often builds up the basis of network relationships, 
which are generally collaborative and more or 
less based on information. Openness is a content- 
related element and it is supported and increased 
by pursuing fairness and equality in collaboration. 
(Saunders et al., 2004) At the same time, and as a 
result of fairness, openness works for decreased 
information asymmetry. 

The concept of fairness is best known from 
Nordic contract law, where the focus is usually 
placed on a real mutual balance between parties. 
And vice versa; the very aim of fairness is to 
decrease the unbalance and asymmetry caused 
by the economic or informational inequality of 
the parties. One of the prerequisites of fairness 
is that the contracting parties openly disclose 
unusual issues known to them. (Atiyah, 1979) In 
this way, every actor of the network gets the same 
information. By these means, interpreting informa- 
tion while making it relative serves for enabling 
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fair and equal access to it. Thus information, at 
its content level, becomes more accessible to all 
stakeholders. 

As a single instrument for information asym- 
metry reduction fairness/equality is relatively 
weak, as it is mainly fixed to content-related is- 
sues. It may be improved and strengthened through 
some more operative principles, like good faith/ 
fair dealing, which works for opennessas well, but 
in a more functional manner. As these principles 
operate side by side they are closely linked to 
each other. This linkage emphasizes and ensures 
that the collaborating parties observe each other’s 
advantages as far as it is reasonable. 


Good Faith/Fair Dealing Constituting 
Information Usability 


Business networking is generally based on the 
common goal setting of partners. The goal is 
often based on information, and the purpose of 
the network is to create added value or synergy. 
(Balloch & Taylor, 2001) On these grounds, 
networking is defined as a functional entity, and 
this entity requires coordinating the roles of the 
stakeholders in order to enable knowledge shar- 
ing. Good faith/fair dealing works for ensuring a 
common set of goals, and it operates for increasing 
the communication among the networking parties. 
By these means good faith/fair dealing on its part 
reduces information asymmetry. 

Asa general principle, good faith/fair dealing 
is in a close relation to contract law, in this case 
with a European one, and it thereby constitutes a 
mirror through which the content of the principle 
may be examined. The Commission of European 
Contract Law has laid down the principles of 
European contract law (Commission of European 
Contract Law, 1999), including good faith, fair 
dealing, and cooperation. In this document its is 
stated (Article 1:201) that each party must act in 
accordance with good faith and fair dealing, an 
obligation the parties are not allowed to exclude 
or limit. In this way good faith/fair dealing sup- 
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ports network activities and makes them more 
transparent. 

Functionality in network relationships is based 
especially on the above principle and it enables 
the communication of information, but it is com- 
munication that makes information usable for a 
larger group of stakeholders. For example, in a 
long-term contract good faith/fair dealing requires 
that the parties observe not only their own interests. 
but also the interests of the other contracting party 
(Annola 2003) Often this is the most significant 
issue when building up a long-term relationship 
or when operating in one; cooperation may not 
even be started without a mutual core of interest 

In general, good faith/fair dealing is bound 
closely to fairness/equality, and both are founded 
on the enabling and ensuring of communication 
Fairness/equality supports relative and interpre- 
tative content for all parties, but good faith/fair 
dealing principally concentrates on communica- 
tion. These principles even seem to suggest that 
communication needs to be carried out in such 
a way that all the stakeholders are treated on a 
completely similar basis. In this way they oper- 
ate together and support each other through a 
functional linkage. This complex is functionalized 
further by trust. 


Trust/Confidentiality 
Increasing Accessibility 


Information asymmetry is often reduced by 
granting access to information and increasing 
its availability. Both fairness (by explicating the 
meaning and value of information) and good faith 
fair dealing (by increasing its usability) improve 
access to information. Granting access to infor- 
mation is, again, based on communication that 
is best described as a part of information, when 
information is seen as an activity rather than a 
possessable object. (Shapiro & Varian 1999) 
Because of this it is less like an inflexible object 
to be implemented in unchangeable conditions. 
This makes information operate likea relationship 
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that does not exist in isolation but is rather linked 
to its more fundamental meaning. In terms of its 
very essence, information is communicative and 
thereby a dynamic asset. 

Successtu(ness atthe communication by which 
the access to these assets is provided is mainly 
based on the capability of the parties to trust each 
other; all the advantages reached by communica- 
tion are derived from the trust/confidentiality of 
the relationship, which is seen as a continuous 
process that includes obtaining and experiencing 
new knowledge about the relationship, sharing 
information with the networking partners, and 
being habituated to trust-based acts. 

As mutual trust between parties is required to 
create a solid base for commitments needed for 
cooperation, the reciprocity thus constitutes the 
essence of trust (Atiyah, 1981) and it operates for 
strengthening the relationships of parties. Each 
level of trusthas a corresponding level of informa- 
tion sharing and in communicating information 
trust is strengthened accordingly. The more part- 
ners trust each other, the more they are expected 
to share information. (Saunders et al., 2004) 


KNOWLEDGE DISCOVERY 
REQUIREMENTS IN IT-BASED 
TOURISM INDUSTRY NETWORKS 


As indicated previously in the domain area analysis 
of the travel business networks the intermediar- 
ies, like an incoming tour operator (ITO), are the 
key actors in the tourism industry. Therefore, an 
existing ITO was selected as the focal company 
of the case study, where a specific travel industry 
network in Lapland area is described in terms of its 
main actor roles, activities and information flows 
(Figure 7), and as part of this research is linked 
to the anti-positivistic information characteristics 
and related knowledge discovery requirements. 
An ITO connects the networks of customers 
and suppliers. Its main function is to match the 
demand for and supply of tourism services and 


packaged tours. These intermediaries operate in 
some geographical destination and they compete 
for customers and tourism-related resources pro- 
vided by different suppliers. The key resources 
for a packaged tour are haspilality services (ac- 
commodation and food), arranged activities for 
travelers, and transportation services to take groups 
of travelers from their accommodations to ac- 
tivities and attractions. The customers of incom- 
ing tour operators are often outgoing tour opera- 
tors residing in the departure areas of travelers. 
They market and sell packaged tours to end cus- 
tomers and are responsible for transporting them 
to the destination area. These kind of information 
technology-oriented tourism business networks 
can be evaluated by analyzing digital enablers 
and service supply network activities, the activi- 
ties of which consist of several types (Rai et al., 
2005): (i) service design, (ii) sourcing, (iii) logis- 
tics, (iv) production, and (v) asset management. 

The essential analytical question here is, how 
are the main service supply network activities 
within travel industry (i.e. the service design and 
realization, and asset management) supported 
by inter-organizational knowledge discovery in 
respect to the relevant generic anti-positivistic 
information flow characteristics. To answer this, a 
set of inter-organizational information flow content 
types that relate to each group of activities in the 
network under study can be identified: (i) service 
specifications, (i1) demand and sales information, 
(iii) consumer information, (iv) service realiza- 
tion, and (v) feedback information. For these, it 
is possible to define the preliminary KD require- 
ments that take into consideration the generic 
information flow characteristics. For example, 
during the service design activities the business 
innovations are either transformed to concrete 
service specifications, or existing services are 
adjusted in response to changing environments. 
These activities would benefit from information 
about the customer and partner feedback for new 
products or from support to the identification of 
important service components. Also, in a more 
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Figure 7. Actors, activities and information flows in the tour operating network 
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production oriented view, the overall network- ° 
wide performance (service realization activities) 

could be enhanced if knowledge about the busi- 

ness rules for service execution optimization and 
recovery tactics were available. For organiza- ° 
tions to succeed in capitalizing on these types 

of network-wide knowledge discovery or asset 
management services they would additionally 

have to respond to a number of specific issues and 
requirements related to the generic information ° 
flow characteristics: 


e dynamics - How is the service consumed 
and what resources are tied to it? 

e semantics - Requires that commonly un- 
derstood and compatible service specifica- 
tions are shared by the consumers, inter- 
mediaries and suppliers. 
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Hospitality services 
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On-site transportation 
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Feedback Maintenance 


Feedback 


relativity - Different value in service 
design for consumers and intermediar- 
ies should be reflected in the service 
specifications. 

usability - Taking into consideration the 
different information needs for consumers 
and intermediaries, and the ability to com- 
bine and compare service specifications in- 
creases their usability. 

openness - Openness and distribution of 
service information should be based on 
information needs that is regulated and 
sanctioned by, for example, immateria 
rights, privacy, security policies based on 
network-wide information security models 
(Aaltonen et al., 2008), and non-disclosure 
agreements. 
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To provide yet another perspective to this 
discussion, the main features and responsibilities 
of a feasible KD services for the travel industry 
business networks can be organized in relation to 
the above mentioned domain specific information 
flow content types: 


e service specification - The essential infor- 
mation contents and service components 
for consumers and intermediaries. 

° demand and sales information - Demand 
patterns for different services 

. consumer information - Combined ser- 
vice consumption patterns and service 
component significance for whole tourism 
experience. 

° service realization - The critical service 
flow incidents. 

° feedback information - Differences of 
tourism experience among consumer 
groups 


In sum, knowledge discovery in the busi- 
ness network environment should support value 
creation in terms of efficiency, innovation, cus- 
tomer lock-in, and resource availability. These 
activities should also enable the digitalization of 
supply network operations in terms of visibility, 
process integration, measurement, and informa- 
tion sharing. In the latter two cases, the benefits 
of network-wide KD would be especially high. 


FUTURE TRENDS 


The main contribution of this chapter is the 
theoretical analysis of the proposed paradigmatic 
change to an anti-positivistic research approach 
in order to support business network-wide knowl- 
edge discovery. In this section, the focus is more 
on the practical future considerations of intra- and 
inter-organizational data mining. In what follows, 
the phenomena under study in this work are first 
reviewed by categorizing the main insights, 


comments and prospective business benefits 
and outcomes to a representative set of trends to 
which the theoretical and practical research should 
respond. The result of this is a summary and the 
identification of the key benefits, overall feasibil- 
ity and the main issues of developing, using and 
adopting the proposed network-wide knowledge 
discovery approach in research and in operations 
at the stakeholder-level. 


Feasibility of Network- 
Wide Data Mining Approach 
at Stakeholder Level 


Many issues remain unresolved at the business 
network participant level in relation to information 
technology adoption in general and it has a severe 
impact on the feasibility of the organizational data 
mining. In tourism networks, as in many other 
similar networked industries with complex inter- 
connected value chain topologies, information 
systems, applications and databases enable the 
network-wide creation, sharing and transfer of 
knowledge. But, in many cases, the coordination 
of business activities and the sharing of knowl- 
edge are hindered by a lack of shared information 
systems, a lack of legacy/proprietary systems 
integration, and an unwillingness to implement 
or use these systems. Also, the practical business 
benefits of such systems are somewhat uncharted, 
but they seem to involve a reduction of manual 
information processing, higher quality of service 
execution, and improved business planning and 
decision making. Additionally, important tour- 
ism systems (dynamic packaging and journey 
management) are not being implemented due to 
inadequate information infrastructures and the 
high cost of systems integration (Hakolahti & 
Kokkonen 2006). However, actors in tourism 
business recognize the need for a cooperative ap- 
proach to protect against vertical integration led 
by large international tourism operators. Yet they 
also recognize the existing culture of independence 
and competition. (Saloheimo et al., 2007) Also, 
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advances in IT have made customers (i.e. tourists) 
more experienced and demanding. (Guzman et 
al., 2008) Based on this it follows: 


e (future trend) increasing overall organiza- 
tional IT adoption has a direct positive im- 
pact on conventional data mining feasibil- 
ity and will indirectly lower the barrier for 
novel network-wide KD approaches. 

° (research enabler) appropriate, cross-dis- 
ciplinary, and participatory research meth- 
odologies (e.g. action research) should be 
designed and applied, in which the stake- 
holder requirements and expectations must 
carefully be collected and observed. 


Knowledge-Centered Business 
Network Management 


The challenges in the field of inter-organizational 
business governance stem basically from the fact 
that the typical operational connectivity properties 
between networked organizations do not easily 
support and allow for mutual strategic decision 
making which has a negative impact on the col- 
laborative management of knowledge assets 
and corresponding information flows. Also, the 
existing asymmetric trust and power relationships 
and the competitive nature of modern commer- 
cial ecosystems usually prevent the sharing and 
communication of critical business knowledge. 
Yet, the existence of some form of knowledge- 
centered network governance is crucial for the 
emergence and optimal utilization of the results 
of knowledge discovery method proposed here. 
Additionally, there clearly is a need for neutral 
and trustworthy network actor who specializes 
on data, information and knowledge gathering, 
mining and dissemination for the whole network. 
Based on the findings in this work and especially 
on the discussion about contractual principles and 
information asymmetry reduction, it is possible 
to suggest that: 
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e (future trend) the assimilation of contrac 
law based functional principles and op 
ing long-term cooperative business agres 
ments over transaction cost reductios 
oriented short-term contracts encourage 
existing business networks to turn into ne~ 
like structures that exhibit and support 5 
characteristics and features of collabors 
tive knowledge based business networks = 
which the products of network-wide daw: 
mining (i.e. business knowledge reposit» 
ries) can more easily be capitalized by te 
participants. 

° (research enabler) in research projects thx 
pursue to analyze and resolve the issue 
of network governance, it seems evides 
that these phenomena should be triange- 
lated from multifaceted, cross-disciplinar 
research angles; particularly combining 
business network theories with IT-enables 
knowledge representation and contra: 
law-oriented functional principles open = 
promising research prospects. 


Information-Oriented Network 
Analysis and Modeling 


Theoretically, this work has engrossed in th 
philosophical and meta-theoretical foundatioe 

of a required paradigm change from positivist: 
view of reality to the anti-positivistic world view 

The essence of the proposed novel approaches 
models and methodologies is in the conceptos 
of information. It is important to realize the 
overlooking the opted information and know- 
edge associated research orientations may hinds 
research that is directed to information-centeres 
real-world phenomena. Most notably this is e» 
dent when the scientific foundations and views are 
not chosen appropriately but instead the researc® 
unconsciously remains constricted, for example. 

aprevailing and traditional paradigmatic approact 

The negative implications of this can be notice 
by comparing the fundamentally different conces- 
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tions of gathered observations (Layer 3 in Figure 
3) under the anti-positivistic world view (social 
facts) and the positivistic world view (static data). 
On the positive side of things, a conscious and 
explicit commitment to developing and designing 
anti-positivistic research methods will widen the 
prospects of future scientific work by promot- 
ing philosophically and semantically grounded 
network structure models and behavior models. 
The information related network research trends 
and enablers can be thus summarized: 


° (future trend) the ongoing and escalating 
sociological and technological paradig- 
matic change from positivistic view of re- 
ality to anti-positivistic view of society has 
a major theoretical and practical impact on 
scientific work and information intensive 
business network research in particular. 

* (research enabler) as the information and 
its significance to the operation of a busi- 
ness network organization is in core of 
DIT theory, the future themes in research 
should include topics such as information 
handling, information processes, the role 
of man in information processes, informa- 
tion quality and detailed analysis of infor- 
mation flow characteristics. 


CONCLUSION 


The main proposition of this research is that the 
traditional conception and implementation of 
organizational data mining fails to address the 
important needs and requirements of knowledge 
discovery in modern business organizations oper- 
ating in highly heterogeneous and inter-connected 
and networked service provision environments. 
The philosophically justified sociological para- 
digm change influences to the theoretical and 
practical research work that focuses on informa- 
tion intensive collaborative environments and 
knowledge eliction, and on representation and 


discovery initiatives. The multidisciplinary nature 
ofthe phenomenon in question draws attention to 
research area conceptualizations and the sharing 
ofthem. All ofthese theoretical aspects have been 
discussed in this work that ultimately proposes an 
anti-positivistic multilevel research model and 
framework to address these issues in ongoing and 
future academic work. 

One of the specific issues here is the observed 
strong information asymmetry between many 
small and medium size business organizations 
(especially in travel industry). To resolve this, 
novel conceptual and practical approaches and 
methods can be developed during the on-going and 
future research by focusing on the characteristics 
of inter-organizational information flows and on 
the requirements of network-wide data mining ap- 
proaches. In this regard, the aim is to fully enable 
organizations to utilize content that manifests itself 
both in the existing internal and external static data 
sources, and in dynamic and relative knowledge 
based and information-intensive communications 
and transactions. 

To summarize, the main observation thus is, 
that in order to enable the provision of research 
oriented solutions to the phenomena of KD in 
business networks, the existing approaches and 
orientations should be extended and elaborated by 
concentrating on the information and knowledge 
modeling dimension of business network opera- 
tions and relationships. This means that success- 
ful theoretical and practical results in the future 
network-wide knowledge discovery research can 
be expected when the focus is directed to issues 
in the overall network-wide information inten- 
sive business governance, especially when the 
examination is based and designed in alignment 
with the meta-theoretical models of discipline of 
information technology (DIT), functional prin- 
ciples of contract law, and on the anti-positivistic, 
ideographic and subjective view of social reality 
construed from social facts. 
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KEY TERMS AND DEFINITIONS 


Network-Wide Knowledge Discovery: = 

contrast to intra-organizational and datehe. 
centered traditional data mining the network-» 
knowledge discovery is based on sociologica 
anti-positivistic, ideographic and subjective + 
of society construed from social facts in contesx= 
where inter-organizational communication ami 
information flows are the main source of know - 
edge acquisition and analysis. 

Socially Construed Reality: A philosophica 
view as the construction of social reality is formes 
by the concepts of facts, social facts and epistemic 
community where the structure of perception = 
based on the psychology of thought. 

Net-Like Structure: A net is a net-like stric- 
ture ifand only if it consists of: (i) nodes (oractors . 
(ii) connectivity features (links or relationships 
(iii) the environment (the logical context or scope. 
(iv) resources (or assets) that can be consumed 
or transferred between the nodes, and (v) the d»- 
namics, which represents the various interactions 
and communications between the actors of a net 

Functional Principles: Contractual principles 
(fairness/equality, faith/fair dealing, trust/confi- 
dentiality) derived from contract law that have 
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an operative and activating nature in supporting 
network governance, knowledge sharing and 
information asymmetry reduction. 

Information Asymmetry: Refers to a condi- 
tion in which at least some relevant information 
is (better) known only to some parties (of a net- 
like structure) and/or there exists an information 
related power balance having typically negative 
impact to business transactions. 

Information Intensive Business Governance 
(IIBG): Is an organizational management ap- 
proach like resource based view (RBV) and pro- 
cess based view (PBV) that emphasizes knowledge 
and information intensive assets (i.e. knowledge 
based view, KBV)as being the core competencies 
and enablers of successful business. 


Tourism Value Chain: A supply chain or 
network in tourist industry, which typically con- 
sists of suppliers of basic tourism services, tour 
operators (wholesale), travel agents (retail) and 
customers (tourists). 

Multidisciplinary Concept Modeling: The 
creation of multiperspective domain area analyses 
in form of semi-formal concept specifications that 
enable the sharing of research conceptualizations 
and provide support to explicating the semantics 
of the key domain entities and their relationships. 
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ABSTRACT 


Clinical Data Mining (CDM) is a paradigm of practice-based research that engages practitioners = 
analyzing and evaluating routinely recorded material to explore, evaluate and reflect on their practice 
The rationale for, and benefits of this research methodology are discussed with multiple exemplars from 
health and human service settings. While CDM was conceived as a quantitative methodology evalua- 
ing the process, intervention and outcomes of practice, it can support qualitative studies encouragime 
reflectiveness. CDM was originally employed as a practice based research (PBR) consultation strates 
with practitioners in clinical settings, but the methodology has been increasingly used by doctoral st» 
dents as a dissertation research strategy either by itself or in combination with other research methods 
CDM has gained international recognition by both social workers and allied health professionals. The 
authors present CDMas a knowledge-generating paradigm contributing to “evidence-informed” practic 
rather than “evidence based practice.” 


INTRODUCTION is rarely retrieved, converted into data-bases and 
systematically analyzed by those practitioners whe 
In the course of their work, social workers and other have generated it. At the same time, these profes- 
allied health professionals routinely generate and sionals are under increasing pressure to integrate 
record massive amounts of qualitative and quantita- research into their practice and to employ research- 
tive information concerning patient needs, services based interventions. 
provided and outcomes achieved. However, other Within social work in the United States and t 
than for accountability purposes, this information a lesser extent elsewhere, Evidence-based Practice 
(EBP) is the prevailing paradigm of practice- 
DOI: 10.4018/978-1-60566-906-9.ch016 research integration (Gambrill, 2006; Gibbs & 


Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 


Clinical Data Mining in the Age of Evidence-Based Practice 


Gambrill, 2002) whereby practitioners are encour- 
aged to conduct exhaustive and critical reviews of 
research literature in quest of interventions that are 
“proven” to be effective based on the accumulated 
evidence of randomized clinical trials (RCT’s) 
and meta-analyses of these studies. Alternatively, 
EBP proponents advocate providing practitioners 
with “manualized” guides to practice, based on the 
- results of systematic reviews and meta-analyses 
conducted by academics. Clearly, this idealized 
model of practice-research integration is fashioned 
according to western medical practice and drug 
_ studies. 

Unfortunately however, there are many rea- 
sons why this approach to social work knowledge 
generation and practice integration is problematic. 
One major reason is that RCT’s are not especially 
suitable for studying social work interventions 
(Epstein, 2001). Another is that EBP advocates 
conceptualize practitioners as mere consumers 
of knowledge, disparaging their accumulated 
“practice wisdom” as “non-scientific” at best and 
“quackery” at worst. In so doing, they minimize 
the potential of practitioners as knowledge pro- 
ducers and further alienate them from the value 
of research. 

This chapter describes an alternative paradigm 
of practice-knowledge generation that engages 
practitioners in evaluating and reflecting on their 
own practice by systematically collecting, analyz- 
ing and interpreting client and patient information 
that practitioners themselves have created. We call 
this analysis of routinely available information 
“Clinical Data-Mining” (CDM) (Epstein, 2001). 
Although CDM was “invented” in the context of 
American social work practice, it has been effec- 
tively disseminated by the authors and applied by 
social work practitioners in Australia, Hong Kong, 
Israel, Singapore and Sweden. In addition, the 
method has been productively employed by allied 
health professionals other than social workers (e.g., 
music therapists, occupational therapists, phys- 
iotherapists, psychologists, podiatrists, speech 
pathologists, etc.) as well as by multi-disciplinary 


teams of health professionals. Finally, in a few 
social work doctoral programs, CDM has been 
accepted asa legitimate research methodology for 
PhD dissertation research either in combination 
with other more established research approaches 
or in its own right. 

Thus, in a decade’s teaching, training and con- 
sultation experience of the authors—together and 
separately—CDM has proven to be an especially 
congenial strategy for engaging practitioners in 
research, for testing research-based knowledge 
as well as practice wisdom, and for producing 
practice-relevant knowledge for social work and 
allied health professions. 

The purpose of this paper is to: 


e To distinguish between EBP as a Research- 
Based Practice (RBP) strategy and 
Practice-Based Research (PBR) 

e To define CDM and identify it as one of a 
number possible PBR strategies 

e To distinguish CDM from conventional 
data-mining, from Secondary Analysis 
(SA) and from Chart Reviews 

° To present and illustrate a typology of 
CDM approaches 

e Describe the basic steps in the CDM pro- 
cess and the methodological variations that 
are possible offering exemplars of each 

° Discuss CDM’s strengths as well as its 
limitations 

° Discuss future potential of CDM 


BACKGROUND 


Although the social work research potential of 
available information as well as its limitations 
were clearly articulated decades ago by Shyne 
(1960), academic researchers largely ignore her 
prescient writing on the subject. Emphasizing the 
inadequacies of available agency-based data (e.g., 
missing information, problems of validity and reli- 
ability, etc.) researchers such as Reamer (1996) 
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and Kagle (1998) briefly consider the possibilities 
and quickly dismiss them. Instead, they as well 
as their academic colleagues privilege research 
based on original data collected by researchers, 
for research purposes, Certainly among academic 
researchers associated with the EBP project, RCT’s 
employing standardized quantitative measures are 
viewed as the “gold-standard” to which all social 
work researchers should aspire (Ell, 1996). 

Four decades after the publication of Shyne’s 
paper, ina paper subtitled “Mining for Silver While 
Dreaming of Gold” Epstein (2001) resurrected 
her method and working as a research consultant 
in social work agencies, placed it in the hands of 
practitioner-researchers. In that paper, Epstein in- 
troduced and illustrated CDM as a Practice-based 
Research (PBR) strategy that practitioners could 
employ to evaluate and reflect upon their own 
practice using the information they already had 
routinely available to them for practice purposes. 
In this paper, the epistemological assumptions 
contained within PBR were distinguished from 
those within Research-based Practice (RBP). 
The latter are consistent with current thinking in 
the EBP movement which privileges RCT’s with 
original data-collection based on standardized 
quantitative methods and summative evaluations. 

The implication of the subtitle was that by 
contrast with RBP strategies, those which fall 
into the category of PBR might be considered 
imperfect from the standpoint of the reliability, 
validity and completeness of the data but they 
might still employ “gold-standard” logic in 
generating practice-relevant and highly useful 
descriptive and quasi-experimental studies. CDM 
is one such strategy which makes use of available 
clinical information even if it is less than ideal 
from a research perspective. And, it has distinct 
advantages—not the least of which is that it can 
be implemented unobtrusively in clinical research 
studies by practitioners with no additional burden 
on consumers. 

A great deal has changed since Shyne’s day 
which makes the analysis of available information 
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by practitioners much more feasible. Most sig- 
nificant are the following changes in informatics 
technology: (1) affordable personal computes 
with enormous memory storage and rapid =- 
formation processing capacities; (2) widespreas 
computer literacy among non-researchers as we 

as researchers; (3) the introduction of “off the 
shelf’? computerized management and clinica 
information systems in social work and othe 
public and private service agencies; (4) easy 3 

use “point and click” data-analytic software: anc 
(5) search engines that make prior research read 

accessible to practitioners as well as to researchers 

What hasn’t changed is the fact that for externa: 
accountability and internal supervisory purposes 
social workers and other health professionals am 
still obliged to maintain quantitative and well = 
qualitative records of their clients’ demograph« 
characteristics, diagnostic information, service 
needs and requests, interventions received, shor 
and long-term outcomes achieved and satisfac- 
tion with service. Taken together, these two see 
of factors significantly increase the potential 5 
the research uses of available information = 
practitioners. 

With this combination of profound technolog- 
cal change and continuing production of infor 
mation in mind, CDM may be currently define: 
as a practice-based research strategy by whit 
practitioner-researchers systematically retries: 
codify, analyze and interpret available qualia» 
tive and/or quantitative information concernim: 
client characteristics and needs, services am: 
interventions received and outcomes achieves 
derived from their own records for the purpose 
reflecting upon the practice and policy implice 
tions of their findings. This definition incorporate: 
the latest developments in its application since © 
was identified as a distinct social work researc 
method (Epstein, 2010). 

Thus for example, while CDM was origina 
conceived as a purely quantitative approach. unde 
special circumstances it can support qualitas 
studies. Similarly, while CDM was origins) 
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conceived as a retrospective research approach, 
Joubert (2006) has introduced the concept of 
prospective CDM. Additionally, while CDM 
was originally employed as PBR consultation 
strategy with practitioners in clinical settings, it 
has been increasingly used by doctoral students 
as a dissertation research strategy either by itself 
or in combination with other research methods. 
Finally, while CDM was originally thought of as 
an exclusively social work research method, its 
use has been extended by Joubert and Epstein 
(2005) to multi-disciplinary and allied health 
practice research. Although CDM relies on avail- 
able data, it differs from secondary analysis in 
that the information it employs was not originally 
collected for research purposes. Instead, its unique 
contribution is to maximize the research potential 
of information that would otherwise contribute 
little to knowledge generation via professionals 
who are not routinely viewed as contributors to 
the professional knowledge base. Perhaps more 
important is the way the process of conducting 
CDM studies offers practitioners an opportunity 
for “evidence-informed” reflection on who they 
are serving, what interventions they are providing 
and what outcomes they are achieving. (Figure 1) 


How is CDM Different from 
Conventional Data-Mining, 
Secondary Analysis and Chart 
Reviews? 


Although CDM, conventional business uses of 
data-mining and Secondary Analysis all rely 
on available information, CDM (as it has been 
employed) is different in various ways. Separate 
from the use of it to answer social work and allied 
health practice-research questions, CDM differs 
from conventional data-mining approaches in that 
it makes use of far less sophisticated statistical 
techniques. In fact, even the use of decision trees, 
regression and cluster analyses which are the 
mainstays of conventional data-mining in industry 
(Rexer, 2008) are rarely used in CDM. Instead, 


practitioner-initiated and executed CDM studies 
rely heavily on univariate and bi-variate analyses 
in descriptive and quasi-experimental studies 
When more sophisticated multi-variate analyses 
are employed they are most likely to be carried out 
by more methodologically sophisticated doctora 
students in their dissertation research studies. 

And while SA is occasionally employed b» 
academic social work researchers as well as 
doctoral students—particularly those engagec 
in policy-oriented research—it makes use of 
available information and existing data-bases that 
were generated for research purposes to begin 
with. By contrast, CDM involves the conversion 
to research purposes of clinically-relevant infor- 
mation that was not generated with research in 
mind. Therein lies CDM’s greatest strength and 
greatest weakness. Thus, it salvages and exploits 
informational resources that would otherwise go 
to waste. At the same time—and here it resembles 
conventional data-mining—CDM researchers and 
research consultants must be prepared to struggle 
with issues of missing data, extreme outliers, the 
absence of key variables and limited theoretical 
underpinning. 

Despite these acknowledged limitations, overa 
decade’s experience of CDM consultation suggests 
that it is an extremely effective way to engage 
social work and other human service practitioners 
in research on their own practice. In addition, it 
generates findings as well as theoretical insights 
that can be shared with other professionals through 
publications and conference presentations. In 
social work, this is particularly noteworthy be- 
cause practitioners are notoriously “reluctant” 
to read and/or conduct research studies (Epstein, 
1987). The exemplars described in this paper are 
presented as “evidence” in support of the oppos- 
ing proposition. 

In many respects, CDM is an extension of what 
have traditionally been called “Chart Reviews” 
whereby practitioners review their own records for 
the purpose of aggregating particular information 
about those they are serving. Often this involves 
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Figure 1. The process of generating practice-based evidence through clinical data mining (Macdona\ 


Carroll, Albiston and Epstein 2006) 
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hand percentaging of information about certain 
categories of service recipients for accountability 
purposes or to identify unmet needs. What makes 
CDM different is its broader scope, its use of 
computerization and data-analytic software and 
its systematic reflection on practice and program- 
matic implications of findings. Rather than see- 
ing them as distinct or dichotomous categories, 
perhaps it is more appropriate to think of simple 
Chart Reviews and complex CDM studies as rep- 
resenting opposite ends of a continuum reflecting 
the research use of available clinical information. 
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PURPOSES OF CDM IN SOCIAL 
WORK AND ALLIED HEALTH 


Although data-mining in other fields is geared * 
building statistical models that predict and validat 
decision-making of one kind or another, CDM as 
the present authors have defined it serves mam 
different purposes within social work and othe 
allied health professions. These are as follows 


° To describe consumer characteristics and 
needs, services received and outcomes 
achieved 
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* To assess the “fidelity” of implementation 
of intervention models and to evaluate the 
relationship between interventions for var- 
ious groups of consumers 

° To articulate, refine and enhance “prac- 
tice wisdom” regarding various forms of 
intervention 

e To identify “best practices” based on avail- 
able empirical information 

e To encourage practitioner “reflectiveness” 
about what they are doing, with whom and 
with what effect? 

° And ultimately, to promote an Evidence- 
Informed Practice (McNeill, 2006, Epstein, 
in press) that is more inclusive than models 
of EBP that rely entirely on original, non- 
clinical data-collection 


THE ADVANTAGES OF USING CDM 
AS A RESEARCH METHODOLOGY 


While it is clear that in every helping profession, 
there is a robust academic research enterprise, 
this program of study tends to emphasize studies 
based on original data, collected prospectively with 
standardized research instruments, preferably inan 
experimental context. Elsewhere, Epstein (2001) 
has referred to this as “Research-based Practice” 
(RBP) aterm which encompasses but is not limited 
to the currently dominant EBP model of practice- 
research integration. The academic emphasis has 
the dual negative consequence of failing to exploit 
vast bodies of potentially valuable information 
and furthering the research/practice antagonism 
by imposing research-generated burdens on both 
practitioners and service recipients in clinical set- 
tings (Epstein, in press).By contrast, legitimating 
CDM as a practice-research methodology opens 
the door to the following: 


° The non-intrusive research utilization 
of enormous quantities of qualitative 


and quantitative consumer and prog 
information 

° Relatively inexpensive access to an=: 
sample sizes and efficient samp == 
possibilities 

° Ease of de-identifying data 

° Efficient data-gathering and low leve's 
study attrition 


With the advent ofelectronic record keeping © 
some organizations and the introduction of stan- 
dardized assessmenttools into clinical information 
systems, the potential for CDM is even greater 


THE DISADVANTAGES OF CDM AS 
A RESEARCH METHODOLOGY 


It would be misleading to suggest that CDM 
doesn’t have its downsides and detractors. 
Wherever and whenever researchers are limited 
to available information (Shyne, 1960) they in- 
evitably struggle with a host of limitations. Thus, 
CDM tends to be: 


° Labor intensive and dirty—especially 
when working with non-computerized 
records 

° Plagued by missing data, relatively crude 
indicators and/or the absence of any infor- 
mation about key study variables 

e Limited in terms of capacity to empirically 
test the validity and reliability of measures 

e Vulnerable to methodological as well as 
organizational problems associated with 
linking multiple data-sets and dealing with 
contradictory and ambiguous information. 

e Less likely to receive large research grants 

e Less likely to be published in journals de- 
voted to EBP “evidence” hierarchy 


Despite de-indentification of data, a few aca- 
demic colleagues have questioned the ethics of 
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conducting retrospective research for a purpose 
for which consumers have not granted specific 
consent. Other criticisms have been more episte- 
mologically-based. Hence negative labels such a 
“cherry-picking”, “fishing expedition” or “hunting 
trip” have been informally leveled at CDM studies 
implying that they were insufficiently scientific, 
a theoretical, not driven by hypotheses and not 
open to refutation. 

Leaving the thorny question of what is science 
aside, while it is true that many CDM studies are 
exploratory and do not test explicit hypotheses, 
they implicitly test practice and program theories. 
Moreover, we would contend that they are as open 
to discovery (both positive and negative) as any 
other form of research and this observation has 
been supported by virtually every CDM study with 
which the authors have been involved. 


BASIC ELEMENTS OF CDM 


Essentially, CDM is an inductive approach to 
research which as a PBR strategy is intended 
to inform practice decision-making. Hence its 
inspiration is often a practice-related question 
concerning consumer needs and/or service 
outcomes. While the majority of CDM studies 
involve the use of quantitative data directly (e.g. 
demographic data) or the conversion of qualitative 
information into quantitative data (e.g. narrative 
accounts of patient improvements in mental 
health) for subsequent analysis, under the right 
conditions CDM can support qualitative analysis. 
In other words, qualitative information analyzed 
as qualitative data (Cordero 2000, Jones 2006, 
O’Callaghan 2001). 

Initially, CDM was conceived by Epstein 
(2001) as an entirely retrospective approach to 
research but Joubert (2006) has since explored 
the possibilities of prospective CDM which will 
be described in a subsequent section of this pa- 
per. Whether prospective or retrospective, CDM 
is at best, quasi-experimental and correlational, 
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relying on the logic of RCT’s but avoiding the 
less desirable aspects of implementing RCT’s i» 
human service settings. Nonetheless, our experi- 
ence with CDM consultation in agency settings 
suggests that even purely descriptive CDM studies 
can offer valuable insights about client or patien 
needs, service delivery and outcomes achieved 
Finally, employing Scriven’s distinction (1995 
CDM studies are intended to provide formative 
knowledge that will aid in intervention targeting 
and refinement or program development, the more 
methodologically sophisticated CDM studies 
can come fairly close to approximating summe- 
tive findings with sufficient external validity to 
cautiously support for external generalization 
Certainly, by now CDM studies in social work and 
allied health have garnered sufficient legitimacy 
to be published in all but the most scientifically 
“orthodox”, peer-reviewed journals. 


THE PROCESS OF CDM 
IN RESEARCH 


Most CDM studies conducted in practice set- 
tings begin with an expressed desire to do some 
form of unspecified evaluative research. As 
practice-research consultants, it is important 
to help practitioners specify their interest by 
arriving at a somewhat clearer notion of what 
it is they want to know? Once this established. 
we follow some fairly predictable steps which 
underscore the utility and appropriateness of the 
“mining” metaphor. And while it is important 
to be somewhat clear about the purpose of the 
initial data-mining, flexibility is important in 
order to be open to unanticipated discoveries. On 
some occasions, these have been quite striking 
but it is safe to say that every CDM study we 
have ever conducted with our practitioner col- 
leagues have yielded surprises of one degree or 
another. The reason for that we suspect is that 
no matter how perceptive we are as practitioners 
and/or observers we cannot compete with the 
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analytic capacity of computers to sift through 

_ large amounts of data. 

Once the general parameters of the inquiry 
are established, the following steps the CDM 
researcher should: 

e Prospect all sources of 
information 

e Assess core samples for available vari- 
ables, quality and completeness of infor- 
mation, accessibility, connectivity of infor- 
mation systems, etc. 

e Inventory samples for potential indepen- 
dent, dependent, explanatory and mediat- 
ing variables 

° Consult research literature for prior claims 
as well as theories, methods of extraction 
and analysis 

e Create original or adapt available retrieval 
tools 

e Devise a sampling strategy 

e Begin small-scale mining and devise meth- 
ods for testing and improving reliability 
and validity of data that is unearthed 

* Once reliability and validity of mining pro- 
cedures is established, begin large-scale 
mining in accordance with the original 
sampling plan 

e Analyze the findings using simple or com- 
plex techniques depending on the quality 
of information, the objectives sought and 
the research resources available 

° Utilize the findings within and outside the 
practice context 


potential 


These steps are described in great detail in 
Epstein’s “handbook” devoted entirely to CDM. 
And while the steps in the process broadly ap- 
proximate what any applied researcher would 


do, it is important to underscore the “openness” - 


of this process to unanticipated findings precisely 
because it begins with what information is avail- 
able rather than with a literature review, a theory 


and sets of specific hypotheses for testing. Hence 
itismore “exploratory” than conventional research 
approaches. This unconventionality may be seen as 
both its greatest strength and its greatest weakness. 
For engaging practitioners who are either fearful 
of research or put off by it, CDM is remarkably 
user-friendly. For academic researchers on the 
other hand, who have a stake in more pristine 
and less messy forms of knowledge development. 
CDM is exceedingly uncomfortable. 

Finally, while the “mining” metaphor remains 
remarkably apposite through the various steps of 
the CDM process, one way that it isn’t involves the 
“raw materials” that are “mined”. Unlike mining 
for precious gems or metals, CDM uncovers and 
refines sensitive information about vulnerable 
human beings. Consequently, at some stage in the 
process, oversight and approval by some externa! 
ethics or human subjects committee needs to be 
secured so that individuals and groups whose 
information has been analyzed and disseminated 
are adequately protected in terms of anonymity. 
confidentiality, etc. 

In practice, this approval may be sought rela- 
tively early or late in the process, depending on 
how routinely accessible the information is to 
those who are doing the data-mining. Certainly. 
committee approval must be attained prior to 
external publication or presentation of findings. 
De-identification techniques and aggregation of 
findings are helpful here and all the studies the 
authors have been associated with have achieved 
approval. Moreover, we as research consultants 
and teachers and our practitioner-researcher col- 
leagues as well as doctoral students have been 
personally comfortable that the rights of those 
whose available information is utilized have not 
been abrogated. Nonetheless, it is important to note 
that some academic researchers have objected to 
CDM studies because they donot routinely involve 
consenting research subjects for that particular 
use of their information. 


ann 


ILS 


Clinical Data Mining in the Age of Evidence-Based Practice 


EXAMPLES OF DIFFERENT 
TYPES OF CDM STUDIES 


As research consultants to health, mental health 
and child welfare agencies and as social work 
educators, our “hands-on” experience with CDM 
falls into two broad categories—practitioner- 
initiated CDM studies and student-initiated CDM 
studies. The former are generally conducted 
by social work or multi-disciplinary teams of 
social workers and members of the allied health 
professions (Joubert & Epstein 2006). Student 
CDM Master’s and Ph.D. studies are necessarily 
individually conducted and authored. Predictably 
CDM doctoral dissertations tend to be the most 
methodologically sophisticated with regard to 
data-analytic strategies employed as well as the 
mixing of CDM with other research methods and/ 
or the use of available as well as original data. On 
the other hand, practitioner-initiated studies are 
likely to have more of an organizational impact in 
the settings in which they were conducted. Both 
types of studies have resulted in the publications 
and conference presentations, often by individuals 
who have never published or presented before. 
Likewise, CDM PhD’s have been a bridge for 
several practitioners to become social work 
academics, bringing their clinical expertise and 
newly acquired research skills to professional 
education. As such, CDM is beginning to play a 
human resource development role. 

Irrespective of whether the CDM is conducted 
ina service agency or educational context, we find 
it most useful for heuristic purposes to present 
CDM studies in three broad categories—1) needs 
studies; 2) monitoring studies; and 3) outcome 
studies Naturally, in their implementation, CDM 
studies may overlap these categories—especially 
so because of the importance of exploiting the 
research potential of all available information. 
In CDM studies where we are considering all 
available information as potential data sources, 
needs, services and outcomes may be reflected. 

As indicated earlier, CDM efforts may yield 
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relatively simple descriptive studies usin g uni- 
variate statistics or straight forward qualitative 
description. Or they may employ bi-variate and 
multi-variate data analyses or complex designs ap- 
proximating RCT studies (Sainz & Epstein, 2001) 
Similarly, qualitative CDM may be approached 
through a complex “constructivist” lens yielding 
results approximating phenomenological studies 
(O’Callaghan, 2006). Inthe exemplars cited below. 
the potential uses of CDM needs, monitoring and 
outcome explorations is only suggested. 


Need Studies 


CDM need studies tend to describe the demo- 
graphic and psycho-social characteristics of cli- 
ents, patients and/or service recipients in order te 
make inferences about unmet needs that existin g 
consumers may bring to the organization. 

Perhaps the simplest quantitative exemplar of 
such a study was conducted by Nilsson (2001 
who collected information from the charts of 18 
“frequent flyers” in a pediatric diabetes program 
in an Australian children’s hospital. The forego- 
ing label was applied by staff to children wie 
were frequently readmitted to hospital and whose 
families failed to comply with medical recom- 
mendations. Nilsson was able to identify unmet 
psycho-social needs that these families displayed 
which were extremely helpful in targeting future 
prevention and treatment efforts. 

A mixed-method CDM needs study was cos- 
ducted by MacDonald et al. (2006) of the purely 
social needs of young adults who had experienced 
their first episode of schizophrenia. Extractin gand 
analyzing (both quantitatively and qualitatively 
intake information taken from a self-administerce 
questionnaire given to applicants for clinics 
services targeted to young adult schizophrenics 
they were able to identify the particular kinds of 


. Social support these young persons wanted from 


family and friends. 
A larger and more complex CDM needs stu an 
involved the purely quantitative analysis of intake 
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information collected in 2 medical clinics in 
New York, designed to identify women at risk of 
intimate partner violence. By retrieving ignored 
questionnaires, converting the information col- 
lected to an SPSS data-base, aggregating and 
correlating the data, Ross, Walther and Epstein 
(2004) were able to empirically demonstrate the 
links between witnessing prior violence in one’s 
family of origin, fear of intimate partner violence 
and mother’s concerns about their own potential 
for abusive behavior toward their children. 
Certainly the most ambitious practitioner- 
initiated, CDM needs study was conducted under 
the auspices of the Adolescent Health Center at 
Mount Sinai Hospital in New York City. This 
project involved several teams of social workers, 
a psychiatrist, health and occupational counsel- 
ors, etc. Applicants routinely completed a self- 
administered questionnaire named “Adquest” that 
queried them about school, work, family, friends, 
drugs and alcohol, sex and sexual orientation, ex- 
perience ofracism, etc. Although the staffroutinely 
referred to the questionnaires as “the data-base” 
they remained untouched and unanalyzed ina file 
drawer once the intake interview was completed. 
While the prior study was clearly the most 
ambitious CDM needs study, perhaps the most 
methodologically sophisticated is a PhD disserta- 
tion currently close to completion at Hunter Col- 
lege School of Social Work where the first author 
teaches. Louis Rodriguez is studying men who are 
“long-term stayers” in homeless shelters that are 
intended to move them into permanent housing 
in New York City. Using survival analysis and 
Cox regression analysis, Rodriguez is seeking to 
derive and validate an empirically-based predic- 
tive model of length of stay with the intention of 
informing homeless shelter policy and services. 
Goldstein (2007) considered a similar length 
of stay for homeless women in New York City 
shelters. Although Goldstein initially employed 
CDM of available information on the women 
and their families supplied by the Department of 
Homeless services for her dissertation, she did so 


in combination with original qualitative informa- 
tion gathered in interviews that she conducted. 


Monitoring Studies 


CDM monitoring studies focus on who is being 
served for what problems and clinical and program 
intervention patterns. As such they may be purely 
descriptive using univariate statistics or qualita- 
tive description to render the reality of practice 
in the agency. Additionally, they may focus on 
the quality of practice by comparing empirical 
patterns of service provision with accountability 
requirements from funding sources, notions of 
“best practice”, and/or the “fidelity” with which 
program models or theories are actually imple- 
mented. Here, bi-variate statistics and correlations 
are employed. Finally, in principle monitoring 
uses of CDM might involve multi-variate analyses 
of institutional racism, sexism or other forms of 
discrimination based on relationships between 
service provision and demographics, controlling 
for problem presentation, diagnosis, etc. 

The following are some practitioner-initiated 
and PhD dissertation exemplars of CDM moni- 
toring studies: 


° Working as a CDM research consultant 
with various levels and kinds of practitio- 
ners at different health settings in Australia, 
Joubert has facilitated numerous social 
work, allied health and multi-disciplinar» 
practice-based research projects. At S: 
Vincent’s Hospital in Melbourne, wors- 
ing with a team of administrators in alle 
health, social work, occupational thera: 
and physiotherapy, Joubert and her præ: 
tioner colleagues conducted a CDM ™ 
toring study of Emergency Deparm== 
admissions and discharges the hea = 
sues surrounding these (Posenelli. J 
Power, Vale, Lewis & Elliot, 2005 
dition to the study findings, their pes 
paper described the ways in 
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CDM process fostered managerial collabo- 
ration in future service planning. 

e In New York City, Dobrof and her col- 
leagues (Dobrof, Doinko, Lichiger, Uribarri 
& Epstein 2000) conducted a study of so- 
cial work services provided to End-Stage 
Renal Disease (ESRD) patients receiving 
dialysis treatment at Mt. Sinai Hospital. 
They were able to both describe the prob- 
lems of Israeli ESRD dialysis patients but 
by comparing findings with Dobrof et al.’s 
study reflect upon differences between the 
Israeli and American patient populations 
and the implementation of dialysis services 
in these two countries. In the comparisons, 
previously unanticipated differences were 
surfaced and discussed (Auslander, 2001). 

° A CDM doctoral dissertation conducted 
by Hanssen focused in part on the issue 
of the “fidelity” of services provided in a 
single, highly regarded Intensive Family 
Preservation (IFP) agency in the United 
States. As a consequence, the two papers 
published from this quantitative CDM dis- 
sertation study in the Journal of Family 
Preservation can be seen as a CDM moni- 
toring (Hanssen & Epstein 2006) and a 
CDM outcome study (Hanssen & Epstein, 
2007). Further exemplars of CDM out- 
come studies are presented below. 


CDM Outcomes Studies 


A final set of CDM studies are those that consider 
the outcomes associated with social work inter- 
ventions. These studies cannot match RCT’s for 
their capacity to demonstrate causality. At best, 
they are only approximations to experiments. Or, 
they may be qualitative accounts “best practices” 
associated with desirable outcomes. However, 
though they are not nearly as highly valued as 
RCT’s by those identified with the EBP movement, 
they do enjoy several advantages over prospective 
controlled experiments. First, the fact that they 
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are retrospective insures that they are unintrusive 
and do not require any change in “natural” modes 
of intervention in order to learn from them. Sec- 
ond, they do not require artificially constructed 
control groups and the necessary withholding of 
service to derive useful results. Third, they de 
not burden service recipients with the completion 
of standardized research instruments for purely 
research purposes. Finally, they do not rely om 
random assignments of clients, patients or service 
recipients to various interventions or to none a 
all. For this reason, CDM outcome studies have 
been comfortably adopted by teams of clinicians 
working in situ as well as by individual clinicians 
seeking their PhD’s. 

In fact, for the first author of this paper, the 
first published CDM outcome study conducted 
by practitioners has served as a prototype for 
all subsequent quantitative CDM studies of this 
kind (Epstein, Zilberfein & Snyder 1997). Thus. 
the Mount Sinai Hospital liver transplant studs 
involved and extensive, retrospective chart review 
by a team of social workers and a psychiatrist 
whose organizational assignment was to assess 
the psycho-social suitability of candidates for 
transplant and to provide on-going social work 
services to those who received transplants (Zil- 
berfein, Hutson, Snyder & Epstein 2001). 

Their study could not “prove” the proposition 
that social work services increased survival rates 
As anticipated, Zilberfein and her fellow data- 
miners found that demographic factors suchas race 
and ethnicity were not associated with survival 
rates. However, to their surprise they discovered 
that patients with a documented history of sub- 
stance abuse fared just as well as those who had 
no such history. Intrigued by their unanticipated 
finding, Zilberfein and her colleagues went on 
to demonstrate that those patients whose records 
revealed a history of substance abuse but a failure 
to “link their liver disease to the substances they 
ingest” had significantly lower survival rates 
than those who acknowledged this connection 
(Zilberfein, Hutson, Snyder & Epstein 2001, 102). 
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A doctoral dissertation conducted by Chan in 
a palliative care unit for dying cancer patients in 
Hong Kong employed many of the same steps and 
techniques first developed in the liver transplant 
study (Chan 2007). However, rather than focusing 
on survival, Chan was interested in the psycho- 
social correlates of patients’ having a “good death” 
and the contribution that being in the palliative 
care unit made to this outcome. He was able to 
demonstrate significant T1/T2 improvements on 
several dimensions and to identify what he refers 
to as a “paradox of helping”. Chan suggests that 
this interpersonal dynamic is culturally based 
but there is reason to think that it is universally 
applicable and clinically relevant. 

Working in the context of child welfare in New 
York City, Cordero pioneered a qualitative CDM 
outcome study for her doctoral dissertation. More 
specifically, her focus was on the reunification of 
foster children with their biological families (Cor- 
dero 2000). Working with scrupulously detailed 
social work records in a highly-regarded private 
foster-care agency, Cordero studied only suc- 
cessful reunification cases. Although her sample 
was relatively small (N=18), these represented 
the entire successful output of the agency for the 
year which served as her “window” into successful 
practice. Nonetheless, she was able to employ a 
typology based on the court-assigned reason of- 
ficial reason for foster placement—.e., parental 
neglect, substance abuse or domestic violence 
and whether they were in kinship or non-kinship 
foster care. 

Employing this typology, she extracted quali- 
tative case information following a three-stage 
“model” of intervention that included initial as- 
sessment, maintenance of a positive foster-care 
environment and transition back to the biological 
family. Lessons learned from this study were 
communicated to the field via presentations and 
training conducted by Cordero at the agency from 
which she acquired the data and through publica- 
tion (Cordero 2004, Cordero & Epstein 2005). 


In addition, her work served as the prototype for 
subsequent qualitative CDM studies. 

In another qualitative CDM doctoral disser- 
tation conducted in Australia, O’Callaghan—a 
social worker and a music therapist—data-mines 
her own personal practice journals to reflect upos 
how music therapy with dying cancer patients 
helps and does not. Employing a “self-dialogic’ 
model of enquiry, qualitative journal entries on 
over 200 patients and over 350 therapy sessions 
and Atlas/ti software, O’Callaghan describes anc 
illustrates the cognitive, emotional, spiritual anc 
physical changes that are reported by patients anc 


interventions. (O’Callaghan 2005, 223). 
Another advantage of CDM versus RCT ts 
demonstrated in the outcomes evaluation portion 
Hanssen’s quantitative CDM study of Intensive 
Family Preservation referred to earlier. In her 
ex-post facto application of experimental logic 
to available data, Hanssen was able to identify 
demographic and clinical variations that appeared 
to mediate or intensify the effectiveness of the 
intervention. Although these wait further testing as 
hypotheses for future studies, they would nothave 
arisen inan RCT where the guiding assumption is 
ceteris paribus, i.e., all other things being equal. 
A final exemplar of a CDM outcome stud) 
involves a doubly mixed-method study. Thus. Mi- 
rabito’s doctoral dissertation (Mirabito 2000) in- 
volved both available and original data-collection 
and quantitative as well as qualitative analysis. 
Conducted at Mount Sinai Hospital’s Adolescent 
Health Center (AHC) in which the CDM need 
studies described earlier (Peake, Epstein & Me- 
deiros 2004) was conducted. Based upon origina! 
interviews and the qualitative data they gener- 
ated, Mirabito constructed a practitioner-based 
theory that “unacknowledged” terminations or 
“drop-out” without any accompanying clinica! 
process occurred frequently with those adolescents 
who were most resistant to treatment and with 
whom treatment was least effective. This theory 
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was logical, persuasive and consistent with her 
expectations. 

The quantitative CDM portion of her study 
allowed Mirabito to test this theory with available 
quantitative retrospective data extracted from 
the case records of 100 systematically sampled, 
closed cases selected from the prior year. Most 
intriguing to Mirabito were those young persons 
who did extremely well in treatment had very posi- 
tive relationships with their therapists but ended 
it with no acknowledgement. Thus, Mirabito’s 
mixed methodology represents a highly creative 
approach to practice-theory development and 
testing. Her findings suggested a more flexible 
treatment termination policy in the agency and 
a more sensitive and differentiated approach to 
young persons with various clinical profiles. 

Joubert (2006) applied CDM prospectively 
in a practice based evaluation conducted by a 
multi-disciplinary care coordination team in the 
emergency department at Sunshine Hospital, in 
the western suburbs of Melbourne, Australia. Us- 
ing a prospective data mining methodology, the 
team had the opportunity to analyze and evalu- 
ate outcomes relating to patients’ representation 
rates in emergency, the number of admissions to 
hospital and the length of stay in hospital. The 
evaluation process utilized existing assessment 
tools for data collection in an attempt to meet the 
needs of a busy team who were implementing the 
evaluation as part of their practice. The design was 
prospective and quasi-experimental with the team 
collecting data over a period of a month and used 
CDM inanalyzing an existing hospital database to 
create a comparison group.. The outcomes of the 
study impacted on practice, and demonstrated the 
importance of acknowledging transdisciplinary 
profession specific skills by means of pathway 
protocols for referral between team members. 

Joubert (2006) used retrospective and prospec- 
tive CDM to explore the relationships between 
the process of assessment and discharge planning 
in a multidisciplinary team at St Vincent’s health 
service. The team explored the role of family 
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and informal support systems in strengthening 
the older person’s quality of life and capacity for 
independent living. Both quantitative and qual- 
tative data from 50 patients recruited into studs 
were collected using routine assessment forms. Is 
phase 1, quantitative data was collected throug® 
the hospital’s patient administrative system and 
medical record audit using retrospective data mis- 
ing. In phase 2, the allied health team used thee 
post-discharge semi-structured interview schedule 
in a telephone interview with elderly patients and 
their careers. The interview schedule was refined 
as a result of this analysis to include more items 
exploring family resilience and informal socia 
networks as routine practice in the geriatric as- 
sessment units in the health service. 

Although the exemplars cited above all involve 
studies of social work interventions or related 
interventions to one degree another, some recesi 
CDM doctoral dissertations illustrate how the 
“mining” of available information can be used 
to reflect on broader social work issues at higher 
levels of abstraction. Although some might define 
these studies as “non-clinical” their methodolog- 
cal inspiration and inherent logic was based upos 
prior CDM studies. 

Thus for example, in what is essentially and 
N=1 study of a single, community mental healt 
organization, Schwartz (2006) used available 
quantitative clinical and management data to 
reflect on the impact that privatization had os 
the agency. More specifically, she demonstrated 
that in this agency privatization did not have the 
negative organizational impact that the socia 
work literature on privatization would suggest. 

The most recently completed data-mining dis- 
sertation with which the first author of this chapter 
was associated is a national study conducted by 
Williams-Gray of over 100 organizations that 
have gone through an accrediting process. View- 
ing the accreditation process as analogous to as 
“intervention” Williams-Gray was interested is 
determining whether and how the accreditation 
process contributed to organizational capacity- 
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building. Her findings demonstrated that within 
narrow limits—i.e., the development of capacity 
to collect and manage computerized informa- 
tion—it did. 

Moving from Schwartz’s N=1 study to Wil- 
liam’s Gray N=256 study toa CDM study of many 
thousands, Saracosti (2007) employed a CDM 
strategy to study the effectiveness of the first 
year’s implementation of a very broadly focused 
anti-poverty initiative sponsored by the Chilean 
government and targeted to the entire nation. In 
her doctoral dissertation, Saracosti used available 
but previously unanalyzed quantitative data to 
test the broad theory of social capital formation 
that served as an underpinning of this national, 
anti-poverty program. 


THE FUTURE OF CDM IN 
SOCIAL WORK AND ALLIED 
HEALTH STUDIES 


The exemplars cited above provide a picture of the 
potential of CDM asaknowledge-generating strat- 
egy in social work and other helping professions 
where RCT’s present serious ethical and profes- 
sional limitations. Accordingly, our experience in 
providing CDM consultation to practitioners who 
were indifferent to or hostile to research indicates 
that CDM isnot only compatible with their practice 
norms and values but contributory to their practice 
reflection and to their appreciation of the value of 
empirical testing. In addition, it encourages them 
to research and understand research literature in 
contexts in which they are conducting their own 
CDM research studies. 

Additionally, doctoral students—particularly 
those who are concurrently practitioners or iden- 
tified with practice—find CDM an intriguing 
and attractive methodology that allows them to 
efficiently use available agency data rather than 
dealing with the problems of collecting original 
data or introducing intrusive experimental designs 
in natural settings. Finally, the recent introduction 


of Propensity Score Matching (citations) to social 
work research brings with it the potential for use 
in CDM studies with large quantitative data-bases. 
With this refined statistical tool, CDM studies can 
come closer to approximating RCT’s in a way 
that Sainz & Epstein (2001) crudely adumbrated 
in the first collection of CDM studies. 

Unfortunately however many academic re- 
searchers and research journals—particularly in 
the United States where EBP is currently domi- 
nant—are resistant to CDM asa practice-research 
strategy. This resistance represents areal paradigm 
conflict that has unfortunate consequences for the 
continued development and dissemination of the 
approach. Thus for example, for three consecutive 
years several submissions for conference presenta- 
tions based on CDM dissertations and on CDM 
as a dissertation “model” have been rejected by 
the main social work research conference in the 
United States. Likewise, the journal connected 
to this conference robustly rejected submissions 
based on one of the CDM dissertations cited 
above on methodological grounds and because 
there had been several prior RCT’s conducted in 
this area. Notwithstanding these prior research 
efforts, submission of the same papers to a journal 
identified with the particular intervention mode! 
that the CDM study explored received immediate 
and enthusiastic acceptance. 

Despite the resistance there are reasons to be 
positive about the future of CDM. The authors are 
aware of practitioner-initiated and academically- 
supported CDM projects currently underway or 
recently completed in Australia, Hong Kong. 
Ireland, Israel, New Zealand, Singapore, Sweden 
and the United States. More CDM doctoral dis- 
sertations are underway. Another collection of 
CDM studies in allied health is in the planning 
stage in Australia. And in addition to this chapter. 
the first author is currently under contract with a 
major publisher to write a CDM “handbook” for 
a series devoted to otherwise “established” social 
work research methodologies. Publication of the 
CDM handbook and the current chapter should 


Clinical Data Mining in the Age of Evidence-Based Practice 


contribute to legitimating CDM in the United 
States and on the world stage. 

Another way to approach this paradigmatic 
Opposition is to view CDM/RCT or the RBP/PBR 
distinction as false dichotomies. Thus Epstein 
(2009) has argued that viewing them in con- 
flicting terms has negative consequences for 
practice-research integration. Instead, borrow- 
ing a concept introduced by McNeill (2006), he 
advocates a methodologically pluralist model of 
“evidence-informed practice” which, rather than 
placing different methodologies on a hierarchy, 
accepts the contributions and limitations of every 
knowledge-generating strategy and epistemologi- 
cal paradigm. 


CONCLUSION 


Clinical Data Mining (CDM) is a practice-based 
research (PBR) methodology, different from 
conventional data-mining, Secondary Analysis 
(SA) and Chart Reviews. While CDM studies 
are intended to provide formative knowledge that 
will aid in intervention targeting and refinement 
or program development, the more methodologi- 
cally sophisticated CDM studies can come fairly 
close to approximating summative findings with 
sufficient external validity to cautiously support 
for external generalization. Although regarded as 
less than ideal from an empirical research perspec- 
tive, CDM can be implemented unobtrusively in 
clinical research studies by practitioners, with 
no additional burden on consumers. Originally 
conceived as retrospective analysis, studies using 
a prospective CDM methodology are offering 
the opportunity to monitor careful data collec- 
tion while still focused on routinely available 
information. CDM is rapidly gaining international 
recognition as a PBR methodology by social 
workers and allied health professionals as well 
as being used increasingly in doctoral disserta- 
tions. The authors demonstrate the considerable 
contribution made by CDM as a practice based 
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research methodology within a methodologically 
pluralist model of ‘evidence—informed practice”. 
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KEY TERMS AND DEFINITIONS 


Research-Based Practice (RBP): The use of 
research-based concepts, theories, designs and 
data-gathering instruments to structure practice so 
that hypotheses concerning cause-effect relation- 
ships may be rigorously tested. 

Practice-Based Research: The use of 
research-inspired principles, designs and informa- 
tion gathering techniques within existing forms 
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of practice to answer questions that emerge from 
practice in ways that inform practice. 

Clinical Data Mining: Practice-based re- 
search methodology that engages practitioners 
in analyzing and evaluating routinely recorded 
material to explore, evaluate and reflect on their 
practice. 

Formative Knowledge: Collecting data for a 
specific period of time to improve implementation. 
to solve unanticipated problems and to record 
whether participants are progressing towards 
desired outcomes. 

Summative Findings: Data that enables a 
judgement to be made about a program’s worth 

Reflective Practice: Involves thoughtfully 
considering one’s own experiences in applying 
knowledge to practice while being coached b» 
professionals in the discipline. It has been de- 
scribed as an unstructured selfregulated approach 
directing understanding and learning. 


Chapter 17 


Data Mining and the Project 
Management Environment 


Emanuel Camilleri 
Ministry of Finance, Economy and Investment, Malta 


ABSTRACT 


The chapter illustrates how data mining and knowledge management concepts may be applied in a 
project oriented environment for both the private and public sectors. It identifies the project environment 
success roadmap that consists of four levels leading to project corporate success. Processes that control 
the dataflow for generating the projects data warehouse are identified and the projects data warehouse 
contents are defined. The rest of the chapter shows how data mining may be utilised at each project suc- 
cess level to increase the chances of delivering profitable projects that will have the intended impact on 
the corporate business strategy. The general conclusion is that there is a need to structure and prioritise 
information for specific end-user problems and to address a number of organizational issues that may 
facilitate the application of data mining and knowledge management in a project oriented environment. 
Finally, the chapter concludes by identifying the issues that need to be addressed by private and public 
sector organizations so that data mining may be utilised successfully in their decision making process. 


INTRODUCTION 


According to Bala (2008), data mining deals with 
the principle of extracting knowledge from large 
volume of data and picking outrelevant information 
that finds application in various business decision- 
making processes. By its very nature the project 
oriented environment deals extensively with data, 
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information and knowledge for a wide spectrum of 
decision-making scenarios. This direct and robust 
linkage of data mining with a project oriented envi- 
ronment will be illustrated throughout this chapter 
by demonstrating how data mining may be applied 
to resolve issues raging from assessing whether 
a proposed project is aligned with the strategic 


` direction of an entity to the delivery of the project 


outputs and outcomes. 
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Twocritical points must beemphasized. Firstly, 
no matter what your profession is, whether it is 
marketing, engineering, manufacturing or ICT 
developmentand whether you work forthe private 
or public sector, you will at one time or another 
be involved in undertaking projects. Secondly, 
to keep clients satisfied, private and public sec- 
tor organizations are continually faced with the 
development of products, services, and processes 
with very short time-to-market windows combined 
with the need for cross-functional expertise. In 
this scenario, the application of data mining in a 
project oriented environment becomes a very im- 
portant and powerful tool for those organizations 
that understand its use and have the competencies 
to apply it. 

A project management environment provides 
many challenges. As a project moves through its 
life cycle the issues involved become numerous. 
Some of these issues include managing the project 
portfolio; having a mechanism in place to capture 
and share project lessons learnt; maintaining the 
critical project data flow processes; defining 
project scope; preparing project bids; planning 
and controlling projects; and assessing project 
risk. Hence, the road leading to success in a proj- 
ect oriented environment is a long and difficult 
one. Many of the concerns related to the issues 
highlighted above may be mitigated through the 
application of data mining tools by the thorough 
sifting and analysis of data related to projects 
previously undertaken. 

Private and public sector organizations that are 
involved in delivering projects normally possess a 
tremendous amount of data related to past and cur- 
rent projects. This voluminous historical projects 
data is often by itself of low value. However its 
hidden potential needs to be exploited for various 
purposes within the project life cycle to ensure 
the achievement of the business objectives and 
more specifically corporate success. Executive 
management must seek ways to exploit data to 
add value to processes and create a new reality 
in terms of establishing innovative practices by 
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capturing intelligence and knowledge across the 
organization. Hence, the project oriented environ- 
ment with its extensive data generating capability 
and capacity has a direct potential link with date 
mining application concepts for private and public 
sector organizations. 

Data mining techniques have been success- 
fully applied to various private sector industries 
in marketing, financial services, and health care 
Governments are using data mining for improving 
service delivery, analyzing scientific information. 
managing human resources, detecting fraud, and 
detecting criminal and terrorist activities. How- 
ever, literature is scarce regarding the application 
of data mining to a project oriented environment 
Generally, the purpose of this chapter is to show 
how data mining concepts may be applied in a 
project oriented environment. It will examine the 
socalled project success framework and show how 
data mining may be utilised at particular stages 
to increase the chances of delivering successful 
projects that will have the intended impact on the 
corporate business strategies of private and public 
sector organizations. 


DATA MINING AND THE 
PROJECT MANAGEMENT 
ENVIRONMENT CONTEXT 


Cooke-Davies (2002) argue that the ultimate aim 
ofan organization should be to introduce practices 
and measures that allow the enterprise to resource 
fully a portfolio of projects that is rationally and dy- 
namically matched to the organization’s business 
objectives and corporate strategy. These practices 
and measures cover a spectrum of tasks, such as 
transforming data to information and information 
to knowledge thus optimizing the informatior 
value chain of an organization and therefore its 
ability to bring projects to a successful conclu- 
sion. Sutton (2005) identifies four distinct levels 
of project success, with each level having its owa 
discipline, tools and techniques. Thus, excellence 
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Figure 1. Project success road map - Suttons project 
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ateach level is viewed as being critical for absolute 
project success. The project success framework 
put forward by Sutton (2005) shown at Figure 
1 takes a holistic corporate approach by linking 
project delivery to corporate strategy. It provides 
a road map which leads to an organization being 
successful in a project oriented environment. The 
objective is to apply data mining techniques as 
one travels along the project success road map. 
It is important to note that there is a definite 
tangible distinction and focus between the four 
success levels proposed by Sutton (2005). Project 
management success refers to whether a specific 
project has produced the desired output (project 
deliverables) while project success refers to 
whethera specific project has produced the desired 
outcomes (project objectives). Hence, project 
output and outcomes are viewed as being separate. 
Repeatable project management success refers to 
the organization’s ability to consistently execute 
projects that have produced the desired output. 
Furthermore, a project corporate success re- 
fers to whether the outcomes produced have the 
intended impact on the business strategy of the 
organization. Sutton (2005) insists that project 
failure may occur at any one of the four levels. 
Therefore, managers are to understand where and 
how they are failing and then target the measures 
that produce the greatest likelihood of success. The 
application of data mining techniques is viewed 


as providing an opportunity for management to 
produce the best likelihood of success at each 
of the four project success levels. Therefore. the 
objective is to identify how data mining may be 
applied at each project success level to facilitate 
corporate success. 

In addition, an organization’s value chain be- 
comes an important notion when examining the 
application of data mining to the project oriented 
environment. One should note that when referring 
to an organization’s value chain we are in reality 
referring to two separate concurrent but comple- 
mentary value chains. One portrays the physical 
value chain and the other depicts the informational 
value chain. Hence, the physical value chain is 
the transformation of tangible resources, such 
as materials and labour, to a finished product or 
service; while, the informational value chain con- 
sists of the data necessary to transform tangible 
resources to a finished product or service. Both 
value chains are necessary, each supporting the 
other, and ultimately they shape the basis of the 
organization’s business survival. 

Admittedly in the knowledge management 
literature there is a major difficulty in the use of 
consistent vocabulary (Hicks et al., 2006). The 
informational value chain in this context was 
viewed to be similar to the knowledge hierarcha 
as defined by Nissen (2000). This researcher 
viewed the knowledge hierarchy as the traditional 


concept ofknowledge transformations, where data 
is transformed into information, and information 
is transformed into knowledge. This is a rather 
simplistic representation of data transformation. 
Hicks et al. (2006, 2007) extended the knowledge 
hierarchy by adding a new personal knowledge 
class (wisdom). Furthermore, Pyle (2003) and 
Wong (2004) refer to the knowledge value chain 
where data is viewed as a detailed record of se- 
lected events that is first identified and created, 
is summarised and structured into information 
for a specific purpose, is then transformed into 
knowledge from information by a structured 
framework. Reference to the informational value 
chain in this text should be viewed as incorporat- 
ing the notions presented by these researchers. 
Data mining or knowledge discovery refers to 
the process of finding interesting information in 
large repositories of data (Ayre, 2006). There- 
fore, the informational value chain is viewed as 
fundamental to the application of data mining in 
private and public sector entities. 

Moreover, data mining is the process of 
analysing data from different perspectives and 
summarising it into useful information; infor- 
mation that can be used to increase revenue, cut 
costs, or both (Palace, 1996). Hence, the focus of 
data mining in the project oriented environment 
context is the exploitation and application of the 
organization’s vast repository of projects data to 
the projects that are in the pipeline or are being 
implemented. The aim is to ensure the maximum 
return on project completion with the consequence 
that the undertaken projects will have the intended 
impact on the private and public sector organiza- 
tions’ business strategy. 


APPLICATION OF DATA 
MINING: THE PROJECT 
MANAGEMENT ENVIRONMENT 


Managers are not interested in what data mining 
is, rather, they want to know what it will do for 
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their organization (Pyle, 2003). Data mining is 
used to search for valuable information from the 
mounds of data collected over time, which cous 
be used in decision making (Keating, 2008). This 
implies that data mining permits private and 
public sector users to analyze large databases t 
solve business decision concerns with the aim of 
increasing revenue and/or decreasing costs. 

Additionally a project oriented environmes: 
incorporates an organization’s informational value 
chain to provide timely and complex analysis of 
an integrated view of data to strengthen the orga- 
nization’s competitive position. This section wi 
address two essential aspects. The first aspect is 
related to the contents of the data warehouse and 
the organizational processes that contribute t 
populate it. The second aspect is the applicatioe 
of data mining methods as a project travels along 
the four project success levels. 


THE HEART OF THE MATTER: 
PROCESSES AND THE PROJECTS 
DATA WAREHOUSE 


Management have six types of resources at thei 
disposal to carry out the projects under thes 
responsibility. These are money, people, materi- 
als, equipment, energy and data. The focus of 
this section is data and the processes needed to 
support the data flow. Datta (2008) makes refer- 
ence to the basic elements of data mining, twe 
of which are; (a) extracting, transforming, and 
loading transaction data onto the data warehouse 
system; and (b) storing and managing the data in a 
multidimensional database system. However, for 
these elements to occur management must have 
the proper processes in place. These processes 
permit the communication and dissemination of 
information and knowledge to the relevant people 
thus achieving the three remaining data mining 
elements, namely, data access by relevant profes- 
sionals; analysis of data by suitable applicatios 
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software; and presenting results in a meaningful 
format to various organizational users. 

In an environment where projects are con- 
ducted by individuals in isolation, the processes 
will most likely be undemanding and involve only 
afew persons. However, in project oriented organi- 
zational environments the processes that determine 
the information flow can be quite intricate. Figure 
2 provides a concise view of the complexity of 
the functions and processes that control project 
information flow in a private or public sector 
project oriented environment. Each functional area 
may generate a combination of data, information 
and knowledge that are required to be stored in 
a projects data warehouse for retrieval, analysis 
and compilation of meaningful reports to resolve 
complex problems. Figure 2 shows the strong 
integration of the various functional areas that 
contribute to the physical and informational value 
chains. The informational value chain consists of 
data from external and internal sources that com- 
bine to provide a holistic and complete picture of 
the organizational project oriented environment 
at any one point in time. 

Figure 3 illustrates that ICT plays a crucial 
role in bringing together the processes and data 
to populate the projects data warehouse that may 
be mined to determine operational patterns and 
resolve specific concerns for private and public 
sector organizations. Figure 3 demonstrates a 
number of fundamental features. Firstly, input 
data may consist of raw data that act as the input 
transactions for Management Information Systems 
(MIS) which generate the transactional databases 
or/and may consist of documented project expe- 
riences, suchas, business strategies; contracts and 
projects scopes; various concerns and solutions; 
and various conflicts and conflict resolution that 
are entered directly into the projects data ware- 
house without an MIS filtering. 

Secondly, MIS provides information for the 
projects data warehouse and may also utilise its 
transactional databases as an input source for 


Decision Support Systems (DSS) and Executive 
Information Systems (EIS). Thirdly, DSS and EIS 
may after executing the relevant business models 
provide information and knowledge to the projects 
data warehouse. Finally, the projects data ware- 
house will consist of data, information, and 
knowledge that will be used by data mining 
methods for the resolution of a wide spectrum of 
project related concerns. The long term objectives 
are to reconcile the varying views of data; provide 
a consolidated view of enterprise data; create a 
central point for accessing and sharing analytical 
data; and develop an enterprise approach to busi- 
ness intelligence and reporting. 

This concept is inline with the five tier knowl- 
edge management hierarchy of Hicks et al.. 
(2007), where the data warehouse is populated 
from various sources, including individual experi- 
ence; databases; learning systems; DSS and EIS: 
knowledge pooling; best practices; expert systems 
and corporate strategy. It is important to note that 
the processes needed to support the data flow and 
the respective critical data sets that are generated 
by them are essential to the four project success 
levels. Finally, the concept is applicable to both 
the private and public sectors, irrespective of the 
industry or government department they represent. 


THE ROAD TO SUCCESS: 
PROJECT MANAGEMENT 
SUCCESS (OUTPUTS) 


Sutton’s (2005) project success Levels 1 and 2 
refer to the project management function. Success 
Level | refers to the successful completion of an 
individual project. However, Success Level 2 
refers to repeatable project management success. 
that is, the organization’s ability to consistently 
execute projects that have produced the desired 
deliverables. The emphasis of Success Level | 
is related to the tasks that achieve project scope: 
project planning and control; and project risk 
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Figure 2. Processes controlling project information flow and data warehouse contents 
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management. While the focus of Success Level 
2 is having a definite project management stan- 
dard and ensuring its compliance throughout 
the organization. Both success levels are related 
to project planning, control and execution. The 
literature revealed a number of potential areas 
where data mining applications may be applied 
to project management success. These potential 
data mining application areas are discussed below. 


Project Proposal Preparation 
and Project Scope 


Project proposal and project scope are intercon- 


nected. A project proposal is an initial definition 
of the project outputs and outcomes, and precedes 
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the project scope. Project scope defines in detail 
what needs to be done and what is excluded from 
the project. A project scope is only undertaken 
when a project proposal has been accepted by the 
client and is usually an annex of a formal legal 
contract between the client and the organization 
executing the project. 

Ifa project proposal is issued as part of a com- 
petitive tender process then timeliness, quality and 
accuracy of the bid preparation become critical. 
However, at the project scope stage quality and 
accuracy become the most important elements 
It should be emphasised that the project proposal 
and project scope establish the overall project 
time and cost parameters, and normally also 
define the payment terms and payment schedule 
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Figure 3. Project data warehouse & data mining 
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Therefore, an organization must overcome three 
major hurdles: to outclass its competitors by 
promptly providing a precise and accurate project 
bid; to be awarded the project; and to execute the 
project within the defined project scope and the 
established contractual parameters to achieve its 
estimated profit margin. 

Nemati and Barko (2002) argue that we are 
living in an age where information is quickly 
becoming the differentiator between industry 
leading firms and second rate organizations. The 
application of data mining atthis level would focus 
on the analysis of the projects data warehouse to 
find similar project requirements configuration for 
projects that have already been undertaken by the 
organization. This may be achieved through text 
mining and rules generation using classification 


and association of the projects data warehouse 
(previous project scopesand contracts). A software 
tool may be used to automatically summarise text 
data and extract valuable rules that may be further 
transformed into a semantic network which may 
provide a concise and accurate summary of the 
analysed text. Nayak and Qiu (2004) have suc- 
cessfully applied this data mining technique to 
analyse software problem reports in pure text for 
the accurate prediction of time and cost in fixing 
software problems ata global telecommunication 
company. 

This data mining application goes beyond the 
concept of exploring patterns and relationships 
within the projects data warehouse to discover 
hidden knowledge; it would aim at enhancing the 
decision-making process by transforming d 


data 


s 
tio 


Ww 


and information into actionable knowledge and 
gaining a strategic competitive advantage. The 
application of data mining tools for the project 
proposal preparation and project scope would 
allow the project management team to prepare 
project bids and project scopes quickly, accurately 
and at a lower cost before the competitors, and 
to be aware of particular concerns related to a 
specific project type. Moreover, this data mining 
application would mean that project bids may be 
submitted to a higher level of quality before com- 
petitors, thus conveying a positive image for the 
organization to potential clients with a resultant 
increase in good will. 


Accurate Estimation of Time and 
Cost to Project Completion 


Traditionally project management success and 
failure is seen as being dependent on the accurate 
estimation of the time and cost of the works to 
be completed and ensuring that works execution 
does not exceed these estimates. Thus, to deliver 
a project on time and within budget requires the 
application of best project management practices 
and tight control of the projects undertaken. 

An essential step within the project planning 
stage is the accurate preparation of activity util- 
ity data. The preparation of activity utility data is 
concerned with estimating the duration and cost 
that each activity within a project will take to be 
completed. Furthermore, an individual activity 
may be conducted by alternative methods using 
different types and combination of resources. 
Hence, the duration and cost activity estimate 
will need to be established for each alternative 
method. These calculations become important 
during project execution particularly when a 
project slips behind schedule and certain critical 
activities need to be expedited. However, a large 
and complex project may consist of hundreds of 
activities, with many activities having different 
execution methods. Therefore, the preparation 
of activity utility data for a particular project 
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becomes a mammoth task both in terms of work 
effort and cost; and is open to the risk of errors 
and inaccuracies. 

Hence, at the project planning stage, data min- 
ing may be applied for the preparation of utilit 
data for current project activities by analysing 
the projects data warehouse and using cluster 
analysis to identify similar activities that had been 
conducted in previous projects and extracting the 
related estimate data and alternative methods of 
executing the planned activities of the current 
project. This data mining application may be 
particularly beneficial to construction industry 
projects, where the resultant analysis may provide 
a combination of ways of executing particular 
activities utilising different equipment, crew sizes. 
and working hours. Admittedly, the resultant activ- 
ity estimates and alternative methods would stil 
need to be reviewed but the overall effort and cost 
to conduct this essential planning task would be 
significantly decreased. 

The data mining becomes also extremels 
beneficial at the project implementation stage in 
situations where critical project activities are close 
to (or are) running behind schedule or when non- 
critical activities are approaching being critical 
In this situation data mining may be utilised tc 
analyse similar activities from the projects data 
warehouse of current and previous projects, and 
suggest alternate methods to carry out the specific 
activity to recover lost time at an optimal cost 
The overall objective in the application of data 
mining at this level is to ensure that the optimum 
economic project solution is being implemented 
with a change in project circumstances. 

The application of data mining methods for 
the project planning and control stage embraces a 
number of different approaches. Iranmanesh and 
Mokhtari (2008) contend that traditional methods 
to deal with the complex task of controlling and 
modifying the baseline project schedule during 
project execution to measure and communicate 
the real physical progress of a project are not 
adequate, since these methods often fail to pre- 
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dict the total duration of a project to completion. 
These researchers have applied the decision tree, 
neural network and association rule data mining 
tools to predict the total project duration in terms 
of Time Estimate at Completion. To calculate the 
Time Estimate at Completion, the Iranmanesh 
and Mokhtari (2008) model applies six input pa- 
rameters, namely, actual cost of work performed; 
budget cost of work performed; budget cost of 
work scheduled; actual duration; earned duration; 
and planned duration. 

The three data mining methods provided con- 
sistent results in that the neural network showed 
that the cost performance index (budget cost of 
work performed divided by actual cost of work 
performed) had the largest weighting among all 
indexes to predict project completion time; whilst 
the decision tree and the association rule methods 
predicted consistent Time Estimate at Comple- 
tion results. The objective of the study was to 
enable the applied data mining tools to accurately 
forecast the project completion time during the 
project execution stage so that the project team 
may assess and monitor projectrisk by measuring 
project progress in time and monetary terms and 
take proactive preventative actions to mitigate 
any adverse conditions. 


Occupational Health and Safety 


Many projects involving the output of physical 
items, such as in the engineering and construction 
environments, regularly encounter occupational 
health and safety issues. The consequences of ac- 
cidents during project execution may turn outto be 
very harmful both in terms of human causalities 
and project cost escalation. Forexample, the West 
Gate Bridge in Melbourne, Australia collapsed 
during construction in 1978. Approximately 2000 
tonnes of steel and concrete came crashing down 
taking the lives of thirty-five workers with many 
others injured. The report from the Royal Com- 
mission (VPRS 2591/P0, unit 14) stated: “Error 
begat error ... and the events which led to the 


disaster moved with the inevitability of a Greek 
Tragedy.” The project was finally completed after 
ten years at a cost of A$202 million. While project 
cost escalation issues may somehow be resolved 
in the long-term, human lives are irreplaceable. 

A data mining method that may be applied tc 
reduce such tragic incidents would be very cost 
effective in terms of human causalities and proj- 
ect expenditure. NASA Engineering and Safety 
Centre (NESC) was established to improve safety 
through engineering excellence within NASA 
programs and projects (Parsons, 2007). One of 
NESC’s objectives is to find methods that enable 
it to become proactive in identifying areas that 
may be precursors to future problems. Parsons 
(2007) argues that problems are better prevented 
than solved. Hence, the goal is to find a method to 
uncover adverse patterns. Parsons (2007) contends 
that NASA’s research findings indicate that clus- 
tering techniques in their particular environment 
are a key component. 

However, cautions show that there is a disparity 
between the generation of data and the true inter- 
pretation (or understanding) of the meaning within 
the data. The findings suggest that when data is 
dynamic, voluminous, noisy, and incomplete 
then learning algorithms are the most ineffective 
and discovery algorithms such as clustering are 
optimal. Furthermore, when the data mining ob- 
jective is exploration, clustering should be used 
as the optimal unsupervised learning technique 
(Parsons, 2007). 

Hence, in a project oriented environment, data 
mining tools such as, learning and discovery al- 
gorithms may be used to determine which project 
activities, skills or/and resources may be more 
prone to occupational health and safety issues 
so that appropriate steps are taken to mitigate or 
prevent adverse occurrences. Furthermore, deci- 
sion trees and association rules may be used to 
detect anomalies in the way project activities are 
being carried out in relation to past projects and 
current regulatory standards (e.g. engineering. 
construction, and occupational health and safety 


standards). The data mining methods applicable 
will depend on the organizational environment 
in terms of the data, information and knowledge 
characteristics, such as quality, volume, integrity 
and completeness. 


Preventative Maintenance 
of Plant and Equipment 


Many project oriented organizations, particularly 
those involving engineering and construction, 
increasingly rely on profits generated from the 
high utilisation of plant and equipment. The 
unscheduled disruption in the use of plant and 
equipment during project execution not only in- 
curs direct costs of labour, replacement parts and 
consumables, but also the consequential costs of 
delays to contract, possible loss of client goodwill 
and ultimately, loss of profit. The findings of a 
study conducted by Barber et al. (2000) regard- 
ing the cost of quality failures in two major road 
projects suggest that the cost of failures may be 
a significant percentage of total costs, and that 
conventional means of identifying them may not 
be reliable. Moreover, these types of costs will not 
be easy to eradicate without widespread changes 
in attitudes and norms of behaviour within the 
industry and improved managerial co-ordination 
of activities throughout the supply chain. 
Srinivas and Harding (2008) propose a data 
mining integrated architecture model that provides 
a mechanism for continuous learning and may 
be applied to resolve concerns regarding process 
planning and scheduling, including extracting 
knowledge to establish rules for identifying 
maintenance interventions. Wang (2007) illus- 
trates the use of data mining to solve a scheduled 
maintenance problem in a manufacturing shop 
which may also be applicable to a project en- 
vironment. Wang’s data mining application has 
two objectives: classification - to determine what 
subsystems or components are most responsible 
for downtime, the “root cause”; and prediction - to 
forecast when preventative maintenance would be 
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most effective in reducing failures. Finally, th 
generated information may be used to establish 
maintenance policy guidelines, such as planned 
plant and equipment maintenance schedule. I» 
this example classification and prediction were 
achieved by utilising decision trees. 

Wang (2007) applied the decision tree approac® 
to classify machine health, with equipment avail- 
ability being the target dependent variable. The 
developed model determined the most sensible 
plant and equipment that are most responsible 
to the low equipment availability. Therefore, the 
aim was to detect the plant and equipment with 
a low availability index thus focusing on specific 
apparatus (or group of apparatus) in order tc 
make the maintenance effort more effective thus 
saving time and cost. The generated nodes on the 
decision tree consisted of the different plant and 
equipment that are classified by the evaluation 
of the equipment availability value. Hence, those 
responsible for maintenance are able to examine 
specific plant and equipment responsible for the 
low availability in this part of the classification 
and take the necessary action. Furthermore, the 
model is able to provide accurate knowledge about 
the specific component that is the “root cause” of 
failure within the indicated plant and equipment 


Project Risk Management 


According to Hubbard (2009, p46.) risk man- 
agement is the identification, assessment, and 
prioritization of risks followed by coordinated 
and economical application of resources te 
minimize, monitor, and control the probability 
and/or impact of unfortunate events. There are 
many causes of negative risks in project execu- 
tion, including delays in the delivery of adequate 
supplies; inadequate quality levels of procured 
items; high turnover of project team members: 
and a host of other potential adverse elements 
These risk sources can be damaging to a project. 
such as having delays in project delivery dates 
and budget overruns. The consequences of these 
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risk occurrences include financial loss; demor- 
alisation of project team members; and harming 
the reputation of the project manager. Projectrisk 
management endeavours to foresee and deal with 
uncertainties that jeopardise the objectives, and 
the time and cost schedules of a project. 

The basis for the project risk management 
process is information and knowledge. Earl (2001) 
argues that knowledge is a critical organizational 
resource and mentions examples from industry 
about how various organizations build and utilise 
their knowledge base. For example, he refers to 
BP Amoco’s philosophy of productivity through 
knowledge reuse and accelerated learning which 
is articulated by their expression: “Every time 
we drill another well, we do the next one better.” 
Earl (2001) illustrates how a typical productivity- 
through-knowledge project at BP follows anumber 
of stages, including: documenting the current work 
process; gathering, summarising and codifying 
knowledge and expertise on critical tasks; and 
conducting post project reviews to assess initial 
goals, examine whatactually happened, and assess 
the variance between outcome and intent. Hence, 
both the positive and negative aspects of executed 
projects are documented for future utilisation. 

The above process ensures that new learning 
and experience is added and validated by the proj- 
ect team and expert facilitators. This way a projects 
data warehouse is maintained with knowledge 
and expertise that has the potential, if suitably 
applied, to identify and quantify risk so that an 
appropriate risk response may be undertaken by 
the project manager. 

Datta (2008) identifies risk analysis as a key 
data mining application area where hidden rules 
may not be obvious to the decision maker. Data 
mining is extremely useful in facilitating project 
risk management. For instance, risk identification 
basically addresses the question: What might go 
wrong? The aim of this process is to identify and 
specifically name the project risks and their char- 
acteristics. Datamining can be applied through the 
analysis of the projects data warehouse, seeking 


learning classification and association rules to 
determine the attributes of the potential identi- 
fied risks. The analysis would closely examine 
the current project plan for areas of uncertainty 
when compared to projects that have already been 
implemented. Hence, the objective of the analysis 
would be to examine the project plans to search 
for issues that could cause the project to be behind 
schedule. The outcome of the analysis would be 
a full risk inventory that would categorise risk 
under a number of major headings consisting of 
two components, namely, the likely cause of the 
specific condition, for example, a sub-contractor 
not meeting the delivery schedule; and the general 
impact of the risk on the project, for example. 
milestones will not be achieved and/or the budget 
will be exceeded. 

When the risk identification process is com- 
pleted, the project manager will closely interpret 
the analysis containing the resultant risk inventory 
and decide on a risk by risk basis which project 
risks are to be further investigated through risk 
quantification. The risk quantification process 
would result in a prioritised list of project risk 
elements that will need a response from the proj- 
ect manager to take advantage of the risk if it has 
a positive trait or to take action to mitigate any 
adverse circumstances should the risk have anega- 
tive attribute. Turner and Zizzamia (2008) apply 
a similar approach using a data mining predictive 
modelling for an insurance claims management 
scenario. They maintain that predictive model- 
ling provides a better understanding of a claim, 
allowing itto identify and prioritise an appropriate 
and immediate response. Their predictive mode! 
has the potential to analyse hundreds of risk at- 
tributes based on the available data to produce a 
numerical score (and rational) indicating expo- 
sure level and complexity. Turner and Zizzamia 
(2008) argue that by using the predictive mode! 
to explore and pinpoint exposure, risk managers 
can optimise resource deployment and minimise 
the process duration. 


In a project management scenario, decision 
tree analysis provides a way of presenting a bal- 
anced view of the risks and pay-outs associated 
with each possible alternative strategy. This type 
of application has the objective of answering the 
following questions: What is the probability of 
meeting the project scope, taking into account all 
known and quantified risks? By how much will 
the project be delayed? What level of contingency 
does the organization need to allocate in terms of 
time and cost to meet the desired level of certainty 
taking into consideration the predicted project 
delay? Where in the project are the most risks, 
taking into consideration the project network and 
all the identified and quantified risks? 

Using decision tree analysis as a data mining 
tool is valuable because it visibly defines the 
decision to be resolved by showing all options 
and associated cost calculations; permits manage- 
ment to fully assess all the likely consequences 
of a decision; provides a feasible framework for 
calculating the outcome values and respective 
probabilities of achieving them; and help man- 
agement to evaluate the available information to 
arrive at the best decisions by selecting the better 
alternative. Statistical methods can also be used to 
assess the impact of all identified and quantified 
risks. The outcome of the statistical analysis is a 
probability distribution of the project’s cost and 
completion date based on project risks to predict 
schedule risk. Schedule risk is the probability 
that a project will go beyond its calculated time 
schedule and cost. 

The data mining application may also provide 
the possible risk response based upon the proj- 
ect risk identification and quantification, thus 
itemising the options available and defining the 
appropriate actions to enhance the opportunities 
and minimise the threats. The aim of the project 
manager at this stage would be to closely examine 
the data mining results and select the best ap- 


proach to address each risk that merit attention’ 


and propose particular actions for implementing 
the selected risk policy. Furthermore, the project 
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risk data mining application should be viewed 
as being a continuous process that will regularly 
monitor and control risk during the entire proj- 
ect implementation life cycle. The continuous 
monitoring and control of risks will identify any 
change in the risk status or if a particular risk 
has developed into an issue. The inventory of 
project risk is not static and constantly changes 
as the project is being implemented, thus new 
risks evolve and other risks disappear. Hence. 
the data mining risk reviews allow the project 
manager to reassess and modify the risk ratings 
and prioritisation throughout the project lifecycle 
until its successful completion. 


Repeatable Project 
Management Success 


As stated previously repeatable project man- 
agement success is the organization’s ability to 
consistently execute projects that have produced 
the desired output. The emphasis here is on “con- 
sistency” in implementing projects successfully. 
Consistency in a project management context is 
normally achieved by having and adhering to a 
uniform project management standard through- 
out the organization. Therefore, the objective is 
to ensure that the stages and steps in the project 
implementation life cycle do not deviate from the 
project management standard by detecting anoma- 
lies in the way the project is being implemented. 
This is particularly applicable to projects that have 
aspecific implementation framework, for instance. 
computer application software development and 
conducting research and development projects. 
According to Eberle and Holder (2007) detect- 
ing anomalies in various data sets is an important 
endeavour in data mining, particularly forhandling 
data that cannot be easily analyzed. Anomaly de- 
tection in data mining is related to the discovery of 
events that generally do not conform to expected 
normal behaviour. Such events are often referred 
to as anomalies, outliers, exceptions, deviations. 
and other similar designations depending on the 
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application domain. Although deviations may 
be infrequent events, their occurrence may have 
serious consequences and therefore their detec- 
tion becomes extremely important. Most anomaly 
detection methods use a supervised approach that 
requires some sort of baseline of information from 
which comparisons or training can be performed 
(Eberle & Holder 2007). There are generally two 
steps in anomaly detection schemes: 


° Building a profile of the “normal” behav- 
iour. These profiles may be patterns or sum- 
mary statistics of the overall population. 

° Using the “normal” profile to detect anom- 
alies. Anomalies being observations whose 
characteristics differ significantly from the 
normal profile. 


In a project management standard compliance 
application, anomaly detection can be based on 
supervised learning whose goal is to develop a 
group of decision rules that can be used to deter- 
mine a known outcome. For instance, the project 
management standard would form the basis of 
the data mining learning model, defining classes 
and providing positive and negative examples 
of objects belonging to the classes. Supervised 
learning algorithms can be utilised to construct 
decision trees or rule sets that work by repeat- 
edly subdividing the data into groups based on 
identified predictor variables which are related to 
the selected group membership. The supervised 
learning algorithms, such as classification, create a 
series ofdecision rules that can be used to separate 
data into specific determined groups. 

Typically, the major difficulty in detecting 
anomalies in a data mining context is to know 
whatis “normal”. For instance, Eberle and Holder 
(2007) assume that an anomaly is not random and 
that an anomaly should only be a minor deviation 
from the normal pattern. They argue that anyone 
who is attempting to hide devious activities would 
not want to be caught, and therefore, they would 
want their activities to look as real as possible. 


This does not appear to be a concern in a project 
management standard compliance application, 
since the “normal” is established by the project 
management standard being used. Another con- 
cern is having noisy data that may hamper efforts 
to detect deviations. However, the type of data 
being generated ina project oriented environment 
will be mostly filtered and therefore clean, hence 
noisy data should not present an obstacle for this 
type of application. 

Furthermore, the project management standard 
may be codified or labelled through a standard 
work breakdown structure (WBS) framework that 
would be mirrored by the WBS milestones for the 
projects being implemented. Hence, the detection 
of any deviation from the project management 
standard by a specific project isa practical applica- 
tion that may be easily achieved by the above data 
mining method. This would ensure that projects 
are implemented to the desired quality at a cost 
effective level; and that the organization’s ability 
to consistently execute projects that have produced 
the desired output is realised, thus achieving the 
primary objective of repeatable project manage- 
ment success. 


THE ROAD TO SUCCESS: PROJECT 
SUCCESS (OUTCOMES) 


This is Sutton’s (2005) project success Level 3. 
Project management is often viewed as the appli- 
cation of knowledge, competencies, methods, and 
tools to achieve the defined project tasks in order to 
satisfy stakeholder requirements and expectations 
from a project. This view takes into consideration 
two aspects, namely, project outputs, that is, the 
actual deliverables; and project outcomes that is, 
the project purpose and objectives. The previous 
section addressed how data mining tools may aid 
in attaining the project outputs; this section will 


‘focus on how data mining may aid in achieving 


the project outcomes. 
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Active Project Stakeholders and 
the Perception of Project Success 


An essential factor to a project’s level of suc- 
cess will depend on the perceptions of different 
stakeholders that have an interest in the project. 
Therefore, a critical consideration is whether or 
not the project achieves its purpose and objectives, 
that is, does the project do what it is supposed to 
do? The answer to this question is very subjective 
because it depends on the eyes of the beholder, 
namely the different stakeholders’ perceptions. 
For example, in a building construction project, 
the outcomes are closely related to the users of 
the building. Is the building functional for the pur- 
pose it was built? Does it accommodate different 
individuals’ general needs? For instance, does the 
building design cater for individuals with special 
needs? Hence, project outcomes are more difficult 
to achieve because they take into consideration 
the operational aspects of the deliverables after 
the project has been implemented. 

The difficulty with achieving project success 
as distinct from project management success is the 
variety of the stakeholders that need to be satisfied. 
These stakeholders may include consumer groups, 
environmentalists, local communities, general 
public, mass media, shareholders, creditors and 
many others depending on the nature of the proj- 
ect. Hence, each industry type may have different 
active stakeholders. However, it should be noted 
that individual entities normally conduct projects 
that are specific to their industry. Therefore, these 
individual entities are in a position to identify and 
know their influential and active stakeholders. 

According to Rennolls and Shawabkeh, (2008) 
knowledge of various forms is recognized as 
a crucial business asset, to be utilized for the 
development of new products and services, and 
hopefully leading to competitive advantage. They 
argue that knowledge management has been high 
on corporate agendas, with the main concerns 
(apart from IT infrastructure) being people and 
culture, and communication and collaboration. 
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Hence, gaining knowledge of the stakeholders 

needs, characteristics, and attitudes are achievable 
and are fundamental in influencing their percep- 
tions. This may be translated into the application of 
data mining methods to establish amechanism for 
building a knowledge warehouse and developings 
learning organization, and utilising it for the pur- 
pose of facilitating the achievement of the project 
outcomes. This process may consist of capturing. 
storing, analysing and sharing project lessons 
learnt about past project outcomes and profiling 
stakeholders’ needs, attitudes and characteristics 


Building a Knowledge 
Warehouse and Developing 
a Learning Organization 


The generation of a knowledge warehouse and 
the development of a learning organization re- 
quire continuous attention and effort. The major 
reasons for this are: (a) knowledge may be ob- 
tained from both the organization’s internal and 
external environments, therefore knowledge is 
infinite; and (b) knowledge must be relevant te 
the organization’s needs, however deciding what 
is relevant may not be a straight forward matter 
Hence, perfection is not entirely possible. Having 
said this however, even the imperfect achievement 
of a knowledge warehouse and the creation of 
a learning organization will have a tremendous 
positive effect on the project success performance 
rating. Currently, the maturation of data mining 
supporting processes which would take into ac- 
count human and organizational aspects is stil 
living its childhood (Pechenizkiy et al., 2008). 
There are a number of activities that may help 
an organization to build and retain knowledge and 
thus develop a learning organization. For example. 
knowledge about project outcomes may be pos- 
sible by collecting and storing remarks (and ther 
source) appearing in the mass media, such as the 
press, internet, virtual media, and televised and 
broadcast media about projects that are relevant 
to the particular industry in which the organize- 
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tion is involved with. These projects may include 
project proposals or projects being undertaken 
within a similar cultural and operational envi- 
ronment, and not necessarily conducted by the 
organization itself. 

Academic literature suggests that a learning 
organization knows how to retain knowledge, 
appreciates the value of sharing collective knowl- 
edge, and grows more knowledgeable with each 
activity it performs (Day and Rogers, 2006). The 
aim is to build a knowledge warehouse about the 
outcomes of projects, and the identification and 
profiling of influential stakeholders. The knowl- 
edge source in this case will be mainly external 
and could come from anywhere and at any time. 
The type of knowledge to be collected will vary 
but the content of critical reviews and their source 
are obviously most relevant. This will enable 
the organization to gain knowledge about what 
society thinks about specific projects (likely out- 
comes) and identify the influential stakeholders. 
However, in generating a knowledge warehouse 
it is important to examine ethical considerations, 
particularly data protection legislation in relation 
to the creation of stakeholder profiles. It is empha- 
sised that reference about influential stakeholders 
should not be viewed as individuals but as generic 
designations. 

Another means of collecting knowledge is 
through the maintenance of an electronic journal, 
documenting specific unique experiences during 
project implementation. This knowledge will be 
mainly from internal sources, however, external 
information sources are also possible, particularly 
from contractors and sub-contractors that are 
involved in the project. This knowledge source 
will involve projects specifically undertaken by 
the organization. Special focus should be given 
to factors related to satisfying project outcomes 
and specifically to remarks and other feedback 
from clients, potential end users of the project 
deliverables and society in general. 

Finally, the creation of a learning organization 
is facilitated by conducting a post project imple- 


mentation review, specifically when the project 
outputs have shifted to the operation stage, where 
project outcomes are brought to fruition. The post 
project implementation review evaluates the proj- 
ect at completion to assess what went right anc 


learnt through the documentation of the good anc 
bad things in the management of the project anc 
capture all comments and recommendations for 
project reviews should occur after major events 
and milestones because data collected close tc 
the event eliminates the bias of hindsight. Such 
a process facilitates a commitment to long-term 
relationships amongst the project teams anc 
stakeholders with the primary objective of having 
continuous improvement through learning from 
project experience. 

This data mining mechanism ensures that 
knowledge gained by individuals is retained for 
the benefit of the organization. A lack of such 
mechanism will mean that knowledge is likely 
to be lost, especially if the individual ceases 
membership to the organization. When employ- 
ees leave an organization, they carry with them 
invaluable tacit knowledge which is often the 
source of competitive advantage for the business 
(Nagadevara et al., 2008). Knowledge lost to the 
organization is likely to be knowledge gained by 
a competitor. Therefore, data mining in a project 
management environment has the potential of 
allowing the storage, retrieving and analysing of 
project experience and knowledge that is shared 
throughout the organization for the achievement 
of the defined project outcomes and not hoarded 
by any particular individual. 


Utilisation of the Knowledge 
Warehouse 


The terminology of machine learning and data 
mining methods does not always allow a simple 


match between practical problems and methods; 
while some problems look similar from the user’s 
point of view, but require different methods to be 
solved, some others look very different, yet they 
can be solved by applying the same methods and 
tools (Van Someren and Urbancic, 2006). Apply- 
ing the appropriate data mining tools for problem 
solving in practice depends on experience and 
an innovative approach in the way a knowledge 
warehouse is utilised. 

The focus in the utilisation of the knowledge 
warehouse to achieve project success is to share 
project lessons learnt about the outcomes from 
previous projects and stakeholder profiling. The 
aim is to predict stakeholder reactions to a proj- 
ect, taking a proactive approach to mitigate any 
adverse stakeholder reactions to the project, thus 
influencing the eventual project outcomes. Ac- 
cording to Datta (2008), data mining can facilitate 
in identifying and exploring patterns of informa- 
tion from massive client focused databases and 
can help to select, explore and model large amount 
of data to discover previously unknown patterns, 
for the advantage of business. This application 
would be similar to a marketing environment 
related to launching a new product or service, 
where predicted reactions to the project by differ- 
ent stakeholders are viewed as the likely project 
outcomes, and the stakeholders are associated to 
different types of clients, with each stakeholder 
type having different requirements, attitudes and 
attributes. 

The objective of this data mining application 
would be to identify the various stakeholders that 
are likely to have an interest in a proposed proj- 
ect; to identify the likely attitude to a proposed 
project by the identified stakeholders; to ascertain 
the characteristics of each stakeholder type; to 
itemise decisions taken in prior projects and their 
respective impact on stakeholder attitudes and 
project outcomes; and to provide suggestions that 
are likely to have a positive impact in changing 
stakeholder attitudes and therefore are likely to 
influence project outcomes. Stakeholder segmen- 
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tation analysis would be an appropriate method 
in this situation. 

The stakeholder segmentation analysis would 
aim at identifying groups of stakeholders that have 
common attributes and that have an interest in a 
proposed project. The stakeholder attributes are 
likely to represent attitudes and resultant behay- 
iours. This stakeholder segmentation analysis 
may be based on supervised learning algorithms 
that take the form of a hierarchical decision tree 
structure such that small segments form larger 
stakeholder segments, with each small segment 
representing a stakeholder type. Furthermore. 
using the knowledge about the attributes of each 
stakeholder type and the decisions taken in former 
projects and their respective impact on stakeholder 
attitudes and project outcomes, this data mining 
application may create a series of decision rules 
that may be used to generate ideas that are likely 
to positively change stakeholder attitudes to the 
proposed project for each stakeholder segment 
with the aim of favourably impacting the outcomes 
of the proposed project. 


FINAL DESTINATION: PROJECT 
CORPORATE SUCCESS 


Project corporate success is Sutton’s (2005) project 
success level 4. The consequence of business en- 
deavours that do not support the business strategy 
is the misuse and under utilisation of corporate 
resources. Hence, it is essential that projects are 
aligned with the organization’s strategic direction 
and that their completion results in a positive 
impact on the organization’s business objectives 
(Cleland and Ireland, 2006). Applying data mining 
methods in a project oriented environment can 
facilitate corporate success. In a practical sense. 
this means using data mining techniques to sus- 
tain organizational initiatives by having a proper 
project selection processes and best practice in 
project portfolio management. 
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Proper Project Selection: 
Strategic Fit of Projects 


According to Cleland and Ireland, (2006) projects 
are essential to the survival and growth of organiza- 
tions. They argue that failure in the management of 
projects in an organization will impair the ability 
of the organization to accomplish its mission in 
an effective and efficient manner. Data mining 
may be used to determine if a project proposal 
is aligned with the corporate business strategy 
before a decision is made about whether to pursue 
it. Similar to the other data mining examples, this 
application is based on having and utilising the 
relevant information stored in the projects data 
warehouse (refer to Figure 3). A combination 
of decision tree learning and statistical methods 
may be used to construct a predictive model. The 
aim of this application is to conduct an analysis 
of the project proposal and the relevant informa- 
tion items held in the projects data warehouse to 
determine a strategic fit index for particular project 
proposals. The strategic fit index would be based 
on assessing the following: 


a. The extent to which a proposed project fits 
within the organization’s activity boundary. 

b. Human and financial resource implications 
of the proposed project to ensure that it does 
not expose the organization to the risk of 
economic non-sustainability. 

c. Stakeholder values and expectations related 
to the proposed project to ensure that unre- 
alistic project execution time frames that 
often lead to ineffective outcomes, do not 
occur. 

d. Long-term influence ofthe proposed project 
on the organization to ensure that undertak- 
ing a major project does not restrict the 
organization from conducting other concur- 
rent projects and hence adversely impact the 
organization’s potential for future growth. 


This data mining application will evaluate the 
strategic fit of a proposed project and will also 
rank proposals in order of priority, since a high 
strategic fit index means a higher project ranking. 


Project Portfolio Management: 
Project Partnerships and 
Analysis of Project Bids 


Undertaking a major project can be viewed as a 
partnership between the project owner (client). the 
organization executing the project, and suppliers. 
The failure of one partner could be detrimental to 
the project and is likely to result in a financial loss 
to some or all of the partners. The extent of harm 
to the project will depend on the failing partner. 
For example, if the client fails, the project will 
likely be abandoned. However, ifa supplier fails, 
the project will probably experiencea delay witha 
resultant financial loss, but the project as a whole 
will likely survive. Project portfolio management 
in this text will take the view of the organization 
executing the project and consider a data mining 
application for determining the financial reliability 
of the client and contractors in the supply chain. 
Van Someren and Urbancic (2006) cite an 
example that uses data mining for predicting fi- 
nancial risk in the banking industry by evaluating 
credit worthiness to forecast the financial state of 
a person, company or other entity by exploring the 
characteristics of their current financial state and 
economic conditions. This example is based on a 
Bayesian model using information about similar 
clients and contractors whose status is known to 
establish a comparable appraisal baseline. The 
input to the model is a mixture of numerical 
and nominal data that is normally available in 
financial statements. Furthermore, Hensher anc 
Jones (2007) using published financial data app?» 
a mixed logit model (or random parameter logit 


the development of more powerful and accurate 
forecasting methodologies to predict corporate 
bankruptcy is of importance to a range of user 


groups, including shareholders, creditors, em- 
ployees, suppliers, ratings agencies, auditors and 
corporate managers. 

A similar approach may be used to assess the 
financial reliability of the project partners, in 
this case, the client and contractors in the sup- 
ply chain. However, it should be noted that the 
generated knowledge from the prediction model 
will need to be presented in a manner that is eas- 
ily understood by the decision makers. There is 
no doubt that these data mining applications will 
enable the organization to understand and assess 
the financial reliability of potential partners and 
enable private and public sector organizations 
to choose their partners carefully and thus avoid 
partial or project failure. 


FUTURE TRENDS 


The applications described above are based on 
the ability to have a well synchronised team 
utilising the extensive knowledge that an enter- 
prise possesses. To engage in innovative project 
management practices such as, supply chain 
management and the sharing of information and 
knowledge across the entire organization requires 
the acceptance ofacollaborative spirit. Computer- 
supported cooperative work (CSCW) systems are 
computer-based tools that support collaborative 
activities that meet the requirements of normal 
collaborative efforts among people (Zhu, 2006). 

Future research should attempt to link data 
mining tools to CSCW. The future widespread 
use of data mining lays in the ability and capac- 
ity of an organization to implement an enterprise 
knowledge framework that permits individuals to 
collaborate in the gathering, storing, analysing 
and sharing of data, information and knowledge 
across the organizational boundaries, be they 
private or public entities. An enterprise approach 
to knowledge management will increase an orga- 
nization’s capacity to apply data mining tools for 
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strengthening its competitive position and the 
ensure corporate success. 


CONCLUSION 


This chapter has examined data mining and the 
project management environment. It has shows 
a number of applications where data mining aad 
knowledge learning may be used at various stages 
in the project management lifecycle with the =- 
timate goal being to achieve corporate success 1 
private and public sectors. Data collected on =s 
own is not of much use and has to be converted 
into information and knowledge so that it can be 
used (Datta, 2008). Data mining is an analyt- 
cal process specifically designed to explore the 
considerable magnitude of data from differess 
perspectives in search of consistent patterns and 
relationships between variables and summarising 
the findings into useful information and knowledge 
that can help an entity to increase its revenue and 
or decrease its costs. 

In the context of enterprise resource planning 
data mining involves the search for patterns from 
statistical and logical analysis of large transactioe 
data sets that can help in decision-making (Mons 
& Wagner, 2005). A proper project managemen 
environment integrates an organization’s informa- 
tion value chain with its decision making process 
to increase its ability for making effective decisions 
in the implementation of projects. 

On the other hand, there is a growing gap 
between more powerful storage and retrieve 
systems and the users’ ability to effectively 
analyse and act on the information they contain 
Both relational and on-line analytic processing 
(OLAP) technologies have immense potentia 
for navigating voluminous data warehouses 
However there is a need to structure and prioritise 
information for specific end-user problems and 
to address a number of organizational issues that 
may facilitate the application of data mining and 
knowledge management in a project oriented 
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environment. Some of the major organizational 
issues that are applicable to both the private and 
public sectors include: 


a. _ Ensuring that data mining applications focus 
and support the strategic direction of the 
organization to gain a competitive advantage 
and satisfy clients’ satisfaction; 

b. Recognising that data, information and 
knowledge are a corporate asset that should 
be proactively managed like every other 
major asset; 

c. Respecting ethical values by ensuring that 
individuals are not the end target of profiling 
exercises since this may be in conflict with 
data protection legislation; 

d. Ensuring the support of executive man- 
agement for sharing data, information 
and knowledge across the organizational 
boundaries; 

e. Recognising that data mining applications do 
not follow the conventional data processing 
way of thinking but require an innovative 
and creative mindset; 

f. Recognising that information and particu- 
larly knowledge are dynamic and therefore 
must be constantly rejuvenated through 
continuous regeneration; 

g. Recognising that the contents of a data 
warehouse depends on well defined data 
flow procedures and processes; 

h. | Ensuring that appropriate security measures 
and procedures are in place to protect the data 
warehouse from unauthorised access and/or 
deliberate and non-deliberate destruction; 

i. Ensuring that an appropriate organization 

_ structure is in place for knowledge man- 
agement and its associated data mining 
functions. 

j. Selecting suitable analytical software tools 
that are compatible with the existing ICT in- 
frastructure and the projects’ data warehouse. 


Finally, it is essential to have a senior manage- 
ment executive who will act as the organization’s 
data mining champion to guarantee the long-term 
sustainability of the data mining investment. These 
measures will ensure that using data mining in a 
project oriented environment will help an entity tc 


achieve corporate success at unprecedented levels 
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KEY TERMS AND DEFINITIONS 


Data Mining or Knowledge Discovery: The 
process of analyzing data from different perspec- 
tives and summarizing it into useful information. 


Data Warehouse: A repository of an entity’s 
electronically stored data designed to facilitate the 
transformation and loading of data for retrieval. 
analysis, and decision making purposes. 

Informational Value Chain: The data neces- 
sary to transform tangible resources to a finished 
product or service. 

Physical Value Chain: The transformation of 
tangible resources, such as materials and labor. 
to a finished product or service. 

Project: A finite (temporary) piece of work 
that has a beginning and an end. 

Project Corporate Success: Whether the 
project outcomes produced have the intended 
impacton the business strategy of the organization. 

Project Management: The application of 
knowledge, competencies, methods, and tools 


to achieve the defined project tasks in order to 
satisfy stakeholder requirements and expectations 
from a project. 

Project Management Success: Whether a 
particular project has produced the desired project 
deliverables (outputs). 

Project Success: Whether a particular proj- 
ect has produced the desired project objectives 
(outcomes). 

Repeatable Project Management Success: 
The organization’s ability to consistently execute 
projects that have produced the desired outputs. 
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Knowledge Discovery in 
Networked Environment 


Rauno Kuusisto 
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ABSTRACT 


Collaboration and networking demands are increasing and lots of organizational communicative ac- 
tivities have moved into technical networks. Need to understand not only how to refine right informa- 
tion contents out of the available data mass but also what type of information is important in various 
information using situations has increased. This chapter delves into the problem area of finding ways 
to support users to find relevant, specific types of information that is related to various phases of op- 
erating in network. Establishing a network, planning operations and managing operations differ from 
each others what comes into information requirements. It will be shown via four generalized cases that 
information requirements vary depending on what phase of networking activity the organization is. Via 
those cases that are based on sufficiently broad empirical material it will be cleared that knowledge 
requirements differ from situation to another. This leads to a conclusion that flexible data mining and 
knowledge discovery systems shall be constructed. 


INTRODUCTION 


The increasing amount of various available data 
and information has been a powerful engine for the 
research of data mining and knowledge discovery. 
Methodology and procedure discovery and develop- 
ment to sort out relevant and reliable information 
out of vast masses of ever evolving and increasing 
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data space have been successfully developed. In 
addition, a great amount of solutions that help to 
discover relevant key words or key expressions 
exist. However those solutions are mainly targeted 
to marketing development purposes, not for net- 
working purposes. For networking purposes lots 
of social media tools and other more sophisticated 
collaboration solutions exist, but they do not answer 
the challenge of finding comprehensively right type 
of information. They rather support people to find 
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other people who are interested in same kinds of 
areas and items leaving information discovery 
on the responsibility of the users. So, the ques- 
tion “How to do it?” is frequently expressed and 

_ answered in case of knowledge discovery. A less 
studied area is what kind or type of information 
shall be discovered for certain information using 
situations. This kind of situations exist e.g. in net- 
worked business environment and inter-authority 
collaboration situations. The question “What to 
do?” that is relevant in this kind of situations is 
expressed more seldom under the topic of knowl- 
edge discovery. 

The purpose of this chapter is to introduce a 
less frequently expressed perspective to knowl- 
edge discovery. This chapter describes an example 
of high-level ontology to solve challenges faced 
when developing algorithms for networking in 
emergent and evolving communication environ- 
ment. Algorithms are not introduced. The focal 
point is to introduce the difference of information 
requirements between various phases in collabo- 
ration situations. Via those differences it will be 
demonstrated that knowledge discovery require- 
ments vary also from situation to another. Informa- 
tion is dealt not with content but with framework 
level. This allows finding general phenomena of 
inter-working situations thus making possible to 
solve general knowledge discovery algorithms in 
complex collaboration environment. Empirical 
material is collected in the context of authority 
cooperation. 

The working environment of organizations 
has changed due the extensive use of informa- 
tion technology. Organizations are more or less 
interrelated to each others and lots of activities 
are executed using technical tools and networks. 
Relationships are changing more or less frequently 
making working environment challenging. New 
relationships are constructed while others are 
in execution phase containing planning and 
decision-making. Those phases differ from each 
others thus requiring different type of information 
exchanged. Organizations are interdependent with 


each others with certain cross-organizational and 
non-organization specific processes. They have 
common interests concerning certain objectives in 
certain situations. Information technology glues 
organizations together in two ways. It enables 
collaboration and the use of non-organizationa! 
specific services, and it enables somewhat free 
information publishing and gathering. The orga- 
nization independent information domain makes 
inter-organizational relationships complex anc 
emergent by nature. This emergence cannot be 
controlled, but the content of mutually availab! 
information can be structurized to some degree by 
using processual and technological tools. Know! 
edge discovery is about combining information 
to find hidden knowledge. This chapter describes 
what type of knowledge shall be discovered when 
acting in evolving cooperation environment. 
Knowledge discovery can be seen as a tool to 
enable more sophisticated way for organizations 
to optimize their efforts to gain their goals on 
adequate networking level. 

Cross-organizational collaboration situations 
in inter-authority context are analyzed to increase 
understanding about the activity environment, 
where knowledge discovery needs may occur. It 
will be shown that information needs will vary 
depending on the phase of activity of an actor. 
The main research question is: “What type of 
information shall be discovered to serve actors: 
needs during different phases of its activity?” 
This question is dealt with examples based on 
empirical findings of several collaboration situ- 
ations of inter-working authorities. The analysis 
of these cases is based on multi-theoretical mode! 
of human information handling. 

Information domain can be divided in two main 
areas. First one is the contents of the information. 
Content is typically defined by requirements of 
doing something. Content is related to subject 
of particular interest. The other main area is the 
information framework. This can be referred as 
the universal level of the information domain 
This universal level describes the information 
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phenomena of the situation under concern. It 
defines general information exchange features of 
getting together and dealing with challenges no 
matter what they are. This universal framework can 
be illustrated, like it is done in this study, with a 
human oriented information categorization model. 
The model acts as a frame of reference to typify 
information requirements in different phases of 
networking activity. The model is an approach 
to the ontology of human information handling 
in a context of a complex adaptive social system 
(Holland 1996). Theoretical basis for modeling 
this human information exchange is based on 
philosophy of communication and cognition, 
theory of knowledge management, sociology, and 
decision-support systems. 

Approach to information is framework and 
universally oriented pursuing to increase under- 
standing about information exchange situations 
offering user focused approach to develop dynamic 
knowledge discovery solutions. The scientific 
approach is hermeneutical supported by validat- 
ing empirical results. The research approach is 
cross-disciplinary. The ultimate goal of this pa- 
per is to open novel viewpoints to extract useful 
knowledge from available data. Basic hypothesis 
is: “Understanding the varying nature of informa- 
tion requirements of a networking organization 
at information framework level offers enhanced 
departure point to develop more user supportive 
knowledge discovery solutions.” 


THEORETICAL BACKGROUND 
AND METHODOLOGY 


Complex Adaptive Systems 


The problem area is approached using complex 
adaptive systems (CAS) theory as a comprehen- 
sive frame of reference (Holland 1995, Kauffman 
1995, Ball 2004). The change of working environ- 
ment phenomena is demonstrated via CAS-base 
hypotheses of traditional and networked acting 
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circumstances. The viewpoint is information that 
is required to successful networking. 

The theory of complex adaptive systems 
(CAS) by (Holland 1995) aims at to explain the 
chaotic nature of multi-actor interactive system 
on the viewpoint of one actor. The CAS theory 
seeks understanding of the adaptive behavior of 
an entity in its acting environment by categorizing 
its basic features. CAS theory divides these basic 
elements in four properties and three mechanisms 
Properties are aggregation, nonlinearity, flows 
and diversity. Mechanisms are tagging, internal 
models and building blocks. 


e Aggregation is a property of an entity. It 
defines that an entity seeks to categorize 
same kind of things in same kinds of class- 
es. All new perceptions are then situated 
into these classes to ease to understand 
the outer world. On the other hand, ag- 
gregation aims to explain, what a complex 
adaptive system does as a whole. It seeks 
to gain understanding about the behavioral 
phenomena of entities defined by certain 
plethora of classes. 

e Tagging is a mechanism that gives a de- 
scriptive symbol for an aggregate. Tag is a 
name or symbol to gather correspondence 
entities together. 

e Flow is property that tells what transfers 
between nodes. Nodes act like processors 
that refine or redirect flows. Nodes, con- 
nectors and flows vary over time. Flow 
has two properties. The first one is called 
multiplier effect. It means that additional 
resource injected into a system produces 
a chain of changes via affecting the inter- 
nal behavior or redirecting properties of 
nodes. The second property is a feedback 
process, where output of a process has ef- 
fects on the input stage of the process. In 
CAS environment several such feedback 
processes and interacting relationships 
are taking place simultaneously, because 
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nodes are connected together via evolving 
connector network. 

e Building blocks form the mechanism that 
enables to construct models in a simple 
way. Each block is tagged aggregation. 
Second level aggregations and models can 
be formulated by combining certain simple 
enough building blocks. Blocks are com- 
bined together in space in a certain order to 
form such models that can be tagged to be 
meaningful for the node. 

e Nonlinearity is property that expresses that 
the outcome of the whole is not the sum 
of its parts. The outcome of multi-actor 
inter-action situation cannot be determin- 
istically counted by knowing the features 
of all entities. 

° Diversity is property that tells that whole- 
ness contains certain amount certain kinds 
of nodes that have suitable role in that 
wholeness. If a specific node is removed it 
will be replaced during time with another 
similar kind of nodes or its roles are trans- 
ferred to other nodes. 

° Internal modeling (or schema) is mecha- 
nism that causes certain behavior of an 
entity, when certain stimulus occurs. 
(Holland 1995, 10-40) 


The world can be considered as a complex 
system of complex systems. It is neither random 
nor accidental. It is a collection of systems’ ele- 
ments with certain kinds ofuniversal features and 
the continuum of their interrelations. This makes 
the world act in a non-deterministic way. This ap- 
parently fuzzy behavior becomes understandable 
if we perceive the system at the right structural 
level. (see Ball 2004; Kauffmann 1995). 

This study focuses on the property called “ag- 
gregation” supported by “tagging”, “flow” and 
“building blocks”. It will be shown that shifting 
from content based knowledge discovery thinking 
towards situation based thinking will give a novel 


opportunity to construct such knowledge discov- 
ery practices that will support networked people 
to gather and release relevant information on the 
viewpoint of the situation they are dealing with 

So, let’s generate hypotheses concerning ag- 
gregation both in content based approach and 
framework based approach. First, content based 
aggregation added with network features (motiva- 
tion of networking, information flow characteris- 
tics and building blocks): 


e People like to categorize the exchanged 
information. Typically information is de- 
fined by subject of interest thus being cat- 
egorized by content. The behavior of the 
wholeness is judged on behalf of aggre- 
gation of those content based information 
categorization models. 

e Social communication networks are de- 
fined by subject of interest. The name — tag 
— of interest guides people to form net- 
works with such people, who express same 
kind of tags. 

° Information flow between various interac- 
tive entities is controlled by content and 
amount. Second order effects are typically 
not taken account. 

° Because of the content based strategy ori- 
enting towards interaction the building 
blocks of creating common models for re- 
leasing and receiving relevant information 
may be different amongst different com- 
municative actors during networking situ- 
ations. This makes communication chal- 
lenging, while different actors are speaking 
on different context. 


And secondly, framework based hypothesis: 


° Aggregation shall be done on the basis of 
networking context and situations require- 
ments instead of communicated informa- 
tion content. Second order aggregation 
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describes in that case the nature of net- 
working instead of the meaning of each 
collaborative party. 

° Tagging is formed in addition to defined 
content also around the phase of network- 
ing activity. Tagging supports context and 
situation based aggregation. 

° Information flows are controlled by the de- 
mands of networking context and situation 
instead of one or several parties’ agree- 
ments of releasable information. Each 
networking party releases such informa- 
tion that is relevant for tagged networking 
aggregation. 

° Building blocks are situations instead of 
organizations or other actors. The outcome 
of the comprehensive context will be con- 
structed as a system of situations rather 
than system of actors. 


High-Level Information 
Exchange Ontology 


Actors’ interests to information can be catego- 
rized in several ways, e.g. on time axis, based 
on information content, based on the role of a 
particular actor or based on the phase of activity. 
Information interests differ from one situation to 
another and also from one actor to another. All 
these interest viewpoints exist during the situation 
where actors are involved. A unified and abstract 
enough structure of describing information shall 
be needed to get an idea, what type of informa- 
tion various networking situations may require 
and further on to structurize various knowledge 
discovery situations in an equal way. 

Pardo et. al (2004) take a holistic view to this 
area and approach challenges via components of 
social processes, resources and technical and social 
artifacts in the contexts of technology, business 
processes, inter-organizational structures and 
policies in social environment. They construct 
a four layered model defined by context. Every 
layer contains similar integration components 
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producing a system with integrating processes 
in the focus. However this model does not take 
account the general level ontology of the ex- 
changed information. Wang & Wei (2007) take 
value chain and process approach to the issue 
Neither have they provided an approach to the 
exchanged information itself concentrating to the 
ways to collaborate. 

However, the approach here will be the infor- 
mation itself despite the processes viewpoint is 
important. For that reason, a high-level abstrac- 
tion of human information exchange ontology is 
shortly introduced here. This ontological mode! 
has been developed, iterated and applied frequently 
during last few years (2004-2008). The model is 
based on communication philosophy (Habermas 
1984, 1989), sociology (Parsons 1951; Luhmann 
1999), cognition philosophy (Bergson 1983: 
Damasio 1999; Merleau-Ponty 1968), organi- 
zational culture (Schein 1992; Hofstede 1984). 
knowledge management (Polanyi 1966; Maier 
2002; Nonaka & Takeutchi 1995) and decision 
support systems (Turban et.al 2005; Marakas 
2003). Various development variants of the mode! 
have been published frequently (e.g. Kuusiste 
2006, 2007 and 2008;2008 Kuusisto et.al 2007: 
Kuusisto & Kuusisto 2008). The up-to-date ver- 
sion is described in Table 1. 

Rows describe the temporality and abstraction 
degree of information. Information at the upper 
row is relatively most abstract, future oriented 
and its effects are long-lasting. The lowest level 
contains information that updates fast, is concrete 
and is observable as immediate events. The column 
on the left contains cultural information described 
by Schein (1980 & 1992). The next column on 
the left contains actors’ internal information. Te 
next right contains information of expressed 
conclusions made by the actor. The column on 
the right describes information that comes from 
outside of an actor or is remarkably affected by 
the world outside the actor itself. Rough contents 
of the information categories are described, as 
well. This model describes the structure of indi- 
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Table 1. The high-level abstraction of human information handling ontology 


Internal facts 


Mission, vision 

A subjective and expressed 
impression of the end-state of 
the actor. 


Conclusions External facts 


Values, Competence 


Task 

Activities or work to be per- 
formed, activities originated by 
upper-level management or by 
the development of a situation 


Decision 
A solution based on thinking 
and assessment. 


Basic assumptions 

Hidden assumptions that will 
guide the behaviour of an actor. 
The fundamental features of a 
culture. 


Foreseen end states 

Future situations most certainly 
reached when activities are 
finished. 


Alternatives to act 
Description of realistically 
executable acting solutions. 


Means 
Activities or methods applied to 
reach an aim or fulfil a purpose. 


Socially true values 
Assumptions that are mutually 
accepted in a certain group 

to be a basis of thinking and 
executing activities. 


Anticipated futures 
Describes a thing, event or 
development that can be taught 
or is expected. 


Possibilities to act 
Describes possible paths to the 
goal that the actor can choose 
and that provide something 
new to the actor. For example, 
strategy alternatives. 


Resources 
Available tangible resources 
such as people, financial 
resources, material, machinery 
and office space. 


Physically true values 
Assumption about structures 
that can be accepted to be valid, 
e.g. organization, division of 
labour, competences. 


Environment 
Describes an area or a space 
that affects an actor, e.g., activi- 


Restrictions 
Things that have to be consid- 


Social artefacts Action patterns 
Structure of a social system, Describes how an actor can 
principles of interaction, behave, e.g., process descrip- ered before planning the use 
description of nodes and their tions and instructions. of resources and means in the ties of media, market trends. 
mutual positions, and observ- context of anticipated futures. national trends, and global 


able behaviour. trends. 


Events 

Describes time-limited events 
caused by actors. For example. 
meetings and sales reports stock 
market prices. 


Event model 
A description that enables the 
outlining of the pattern of a 
situation. For example, reports, 
documents, analyzed conclu- 
sions such as quality reports, 
statistics, pictures and maps. 


Features 
Describes properties of objects 
such as the properties of an 
organization or equipment, e.g., 
infrastructure descriptions and 
properties of equipments. 


Physical artefacts 
Results of activity, like techni- 
cal results of a group, written 

and spoken language, symbols, 
art. 


vidual and shared information and sense-making. alternate ways to operate are refined. The chain 
With the help of this ontology the complex infor- of deduction can be continued until the ultimate 
mation exchange activity can be simplified and decision-making layer is reached. There, all output 
emerging phenomena of inter-working network information from the lower layers shall be avail- 
can be found. able in explicitly expressed form. Conclusions of 
Every layer of the model has a specialized a neighbor layer are relatively more meaningful 
task while handling the information for various than information on the other layers. The whole 
purposes, such as situation follow-up, planning spectrum of tacit dimension shall be available for 
and decision-making. The layer that deals with the decision-maker. The decision-maker must be 
event information produces an updated picture of able to know the action patterns, anticipate the 
events. On the next layer, restrictions are sorted change of the situation, foresee the end-state of 
out. The next two layers contain information about the action and deeply understand the meaning 
resources and means as internal facts. These facts of the mission as a part of the bigger continuum 
as well as information about events and environ- of action. 
ment, and knowledge about the composition and This ontology of human information handling 
the development of the situation and possible end- structure is used to analyze various and different 
states are used as a basis of making conclusions. information sharing and information exploitation 


The possibilities to act and information about situations. Because it is universal, it can be used 
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to analyze and develop knowledge discovery 
practices, too. 


Methodology 


Content analysis is used as methodology to find out 
information requirements abstractions in different 
working situations. Using Krippendorff (1980) 
criteria to define the context of content analysis 
results this research case as follows: 


e Data of the content of various informa- 
tion exchange situations were analyzed. 
Analyzed data is described in the context 
of every studied case. 

° Data was defined using expert evaluation 
criteria based on the ontology described in 
chapter 2.2. and illustrated in Table 1. 

e Population of the data drawn is described 
in the context of each studied case, as well. 

° Context relative to which data are analyzed 
is information exchange requirements of 
interacting actors in different situations. 

° The target of generalization of data is to 
find out the patterns that may reveal the 
differences of information exchange pro- 
files between various situations. (About) 
five of the most frequent categories de- 
scribed in Table 1 are bolded and (about) 
five next most frequent categories are un- 
derlined in the final analysis of each case to 
demonstrate the differences of information 
requirements between various situations. 
Quantification of analyzed and abstracted 
data is made by counting relative shares 
(%) of the analyzed data situated into each 
category. That makes final results compa- 
rable to each others. 

e Boundaries of the analysis are defined by 
the limitations of the total amount of ob- 
served information exchange situations 
and ambiguity when sampling data into the 
nominated categories of the used ontologi- 
cal model. 
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Because of the boundaries of the analysis the 
method is not completely reliable. Only one ob- 
server was used to observe and analyze all studied 
cases. This observer categorized and counted 
manually key expressions and ideas of contents 
expressed in texts or speech. Categorization was 
done using Table 1 category definitions as refer- 
ence. It is obvious that interpretation in this kind 
of situation is to some degree subjective. Because 
only one observer-analyst was used, the intra- 
rater reliability (stability) can be considered te 
be rather high, but the inter-rater reliability may 
be moderate. However, the reliability problem is 
not so severe that it sounds, because exact results 
are not searched. The aim of the content analysis 
in this research is to find out possible differences 
of information requirements in different using 
situations to demonstrate possible requirements 
to knowledge discovery techniques. Various cases 
were analyzed during the period beginning from 
August 2007 and ending to December 2007. A 
priori coding was used applying the ontology 
described in Table 1 as theory. 


CASE STUDIES 
Actor Profile Information 


Twenty individual actors from military, govern- 
mental and non-governmental organizations tak- 
ing part to cooperation experiment in December 
2007 were surveyed to find out, what kind of 
information they would prefer to have from their 
potential networking partners. The actual question 
was: “When joining the network community, what 
types of information about your potential col- 
laboration partners would you like to have?” The 
question did not concern general administrative 
information like contact information or technical 
information. Free word answers were analyzed 
like described in chapter 2.3. Results are shown 
in Table 2. All results in tables are converted into 
relative shares (%). 
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Table 2. Relative shares (%) of information needs for network partner choosing situations 


Social artefacts Action patterns 


18 


Features 
18 


Physical artefacts 
3 


Values, Competence Internal facts 


Basic assumptions Mission, vision 
9 21 
Socially true values Means 
6 6 
3 9 


Alternatives to act 
0 
Possibilities to act 
0 
Restrictions 
0 
Event model 
0 


External facts 


Task 
6 


Conclusions 


Decision 
0 
i Foreseen end states 
0 


Anticipated futures 
0 


Environment 
0 


Events 
0 


Table 3. Relative shares (%) of information categories that real organizations release about themselve 


Values 


In addition, web pages of five real organiza- 
tions were analyzed using the Table | information 
categorization model and content analysis meth- 
od. The categories that were found and their 
relative frequencies are shown in Table 3. 

Those two tables look alike enough to be 
combined and mean value counted. As summarized 
and categorized and highlighted the result is as 
Table 4 describes. 

It can be noticed that most important informa- 
tion in network foundation phase will concentrate 
to every actors internal facts added with values 
and competence information. In addition, after 
analysis of thirty crisis management professionals’ 
experiences and somewhat comprehensive lessons 
learned report (Linder 2006) of crisis management 
practices on the field, one additional category 
raised up as a quite important one. (Kuusisto & 


Basic assumptions Mission, vision Decision 
6 14 0 
Socially true values Means Alternatives to act 
6 16 0 
Physically true values Resources Possibilities to act 
6 6 2 
0 14 


Restrictions 
0 
Physical artefacts Features 
4 10 0 


External facts 


Foreseen end states 
6 
Anticipated futures 
0 


Environment 
4 


Events 
0 


Event model 


Kuusisto 2008) This category is “environment”. 
Information about all working environment fea- 
tures and issues was found crucial to success- 
fully work on the area. So, we need to know the 
basic features of our working environment in 
addition to the internal facts of organizations or 
individuals on the network. 


Planning the Mission 


Next, some findings based on the analysis of an 
inter-agency cooperation exercise are presented 
Exercise was arranged in Finland on August 2007 
Participants were professionals form several min- 


istry- and agency level organizations. The nature 


ment (i.e. optimizing the available resources 
to deal with ongoing situation to respond the 
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Table 4. Relative shares (%) of information requirement profile for network partner choosing situations 


Internal facts 


Values, Competence 


Basic mr dah Ag 


Socially a“ values 


ee vision 


Social GFE Action patterns 
16 
Features 
14 


Physical artefacts 
4 


Conclusions External facts 
Decision Task 
0 6 


Alternatives to act Foreseen end states 
0 3 


Possibilities to act Anticipated futures 


i 0 
Restrictions Environment 
0 2 
Event model Events 

0 0 


Table 5. Relative shares (%) of unidirectional released information during briefings 


foreseeable near-term future). Observed activity 
phases concentrated to mission planning prior to 
mission execution. Three different information 
exchange situations were observed. First, a total 
of four general briefings were followed. Second, 
one decision discussion were observed. Finally, 
the information content that was released by all 
participants into the collaboration support system 
database, were analyzed at the title-level. 

In briefing situations, mainly all informa- 
tion was shared in a unidirectional way by the 
briefer. About 10% of the information items were 
discussed. Discussion dealt merely with means, 
resources, alternatives to act, possibilities to act 
and restrictions (See Table 5 and Table 6). 

The decision and planning discussion informa- 
tion exchange profiles concerning released and 
discussed information are described in Table 7 


366 


Values, ee Internal facts 


Socially - values Means mere to act 
20 
rene true values Resources haii da to act 
18 
Social TRR, Action gs Restrictions 
15 
mesa artefacts mat model 


External facts 


and Table 8. Now, because it was question of 
experts’ contribution to support decision making, 
about 50% of the released information was also 
discussed. Almost every other item was discussed 
expect the situation facts and final decisions. 
Information released and discussed situated 
mainly in the middle of the information exchange 
model. One deviation compared to discussions 
during briefings exists. More information about 
futures development was both released and espe- 
cially discussed during decision-making situations 
than during briefings. 

Table 9 shows the distribution of the informa- 
tion types of the generally published information 
The classification criterion was the content of the 
title of released publication. As a whole, same 
kind of information types were released in gen- 
eral publication process than in unidirectiona 
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Table 6. Relative shares (%) of discussed information during briefings 


Values, Competence Internal facts 


Basic assumptions Mission, vision 
0 0 


Means 
7 
7 


Socially true values 


0 l 
Physically true values Resources Possibilities to act 
0 1 17 
Social artefacts Action patterns 
0 0 
0 0 


Conclusions External facts 
Decision Task 
0 0 
Alternatives to act Foreseen end states 
17 0 
Anticipated futures 
0 
Restrictions Environment 
29 0 
Event model Events 
0 0 


Table 7. Relative shares (%) of released information during decision discussion 


Values, Competence 


Basic assumptions Mission, vision 


phases during briefings and decision discussions. 

Mean values of both released and discussed 
information were counted to get an idea, what 
kind of information exchange profiles exist while 
releasing and discussing about mission planning 
and if differences can be found between these two 
information using situations. Results are demon- 
strated in Table 10 and Table 11. Differences were 
observed, indeed. Information profiles while re- 
leasing information in briefings and discussing 
about further actions seem to be different. 

Some brief conclusions can be made on behalf 
of those results presented above. In tactical plan- 
ning situation, information in the middle of the 
model comes important in addition to situation 
follow-up and decision information releasing. 
During briefings, where lots of representatives of 
various organizations were present, discussions 
raised up mainly about available means and re- 


0 
Socially true values Means Alternatives to act 
0 21 2 
Physically true values Resources Possibilities to act 
0 10 5 
Social artefacts Action patterns Restrictions 
0 2 5 
K ri 
0 0 2 


External facts 


Decision 


Foreseen end states 
2 


Anticipated futures 
2 


Environment 
2 


Events 
21 


sources and about possibilities and alternatives 
to act, as well mutual restrictions for activities. 
In the case of small group decision-making dis- 
cussion, the general information releasing profile 
was quite equal to the one with briefings. What 
comes into the discussed information categories, 
still the means and resources items were found to 
be important, but discussion about alternatives to 
act moved towards to anticipate the future and to 
evaluate the possible end-states of overall activ- 
ity. Discussing about mutual future orients parties 
to work together more longer periods than to only 
deal with the emerging issues. 


Executing Task 
Barents Rescue 2007 (BR) search and rescue 


exercise tactical management execution activi- 
ties was analyzed (http://www.pelastusopisto.fi 
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Table 8. Relative shares (%) of discussed information during decision discussion 


Values, Competence Internal facts 
Basic assumptions Mission, vision Decision 
0 0 0 
Socially true values Means Alternatives to act 
0 38 6 
0 


External facts 


Foreseen end states 
6 


Anticipated futures 


0 19 13 


6 
Social artefacts Action patterns Environment 
0 6 6 


Events 
0 


a = 
0 0 0 


Table 9. Relative shares (%) of collaboration support system database published information 


Values, Competence Internal facts 


Basic assumptions Mission, vision Decision 
0 0 4 
Socially true values Means Alternatives to act 
0 24 
0 
0 


External facts 


Foreseen end states 
1 


Anticipated futures 
6 


1 
Physically true values Resources Possibilities to act 
15 5 
Social artefacts Action patterns Restrictions 
6 6 
Physical artefacts Event model 
0 4 6 


Table 10. Relative shares (%) of unidirectional released information during mission planning 


Values, Competence Internal facts Conclusions 


Basic assumptions Mission, vision Decision 
0 0 10 
1 2 


Environment 
il 


Events 
19 


External facts 


Socially true values Means Alternatives to act 
22 
0 14 4 


Social artefacts Action patterns Restrictions Environment 
0 4 10 2 
Physical artefacts Features Event model Events 
0 2 4 17 


pelastus/cmc/home.nsf/wLatestEng/ 0BF38AD- airliner accident search and rescue operation in 
DB817B1E1C225726E003022B4, visited De- Finnish Lapland. 

cember 08, 2008). The nature of Barents Rescue Acollaboration support system was established 
was operational. Various authorities and volunteer to support the actors to release and use relevant 
organizations from four countries executed an planning and mission execution information. In- 
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Table 11, Relative shares (%) of discussed information during mission planning 


Values, Competence Internal facts 


Basic assumptions Mission, vision 
0 0 


Socially true values 


Physically true values Resources 
0 18 
0 0 


Conclusions External facts 
Decision Task 
0 0 
Alternatives to act Foreseen end states 
12 3 
Possibilities to act Anticipated futures 
15 3 


Restrictions Environment 
18 2 


Physical artefacts Features 
0 0 


Event model Events 
0 0 


Table 12. Relative shares (%) of published information categories on collaboration support system in 


BR mission execution 


formation that was published on the collaboration 
support system was analyzed to get an idea about 
the pattern of the information exchange profile in 
the mission execution phase. 

In the first phase information publishing was 
more concentrated to features - events layer, and 
when the rescue execution started the publishing 
contained decisions. It can be noticed that value 
and mission information was not released at all. 
That is because the mission was clear to all. An- 
other interesting feature is that information about 
futures development and conclusions (expect 
decisions) was not released, as well (Table 12). 
This tells that various organizations did not have 

to make collaborative planning. This can have 
two explanations. Either the division of tasks 
was totally clear to everyone or the organizations 


Values, Competence Internal facts 


Basic ke pega vision Decision 
10 
Socially N values TEO to act 
cat true values Resources pik acs to act 
20 
Social re Action patterns eee 
14 


External facts 


Event model 


3 


did not see added value of on-line collaboration 
during mission execution phase. 


Conclusive Findings 


Asaconclusion itcan be argued that improvement 
of acting in networks would need a concept that 
provides as good a system as possible to improve 
the potential of focusing to relevant types of in- 
formation of the knowledge discovery systems. 
IT can be postulate that different kind of situa- 
tions requires different kind of emphasis concern- 
ing the type of the required information. So, it 
can be concluded that the hypothesis set for this 
study, will be considered as truthful. Further on, 
we can postulate that different phases of activity 
require differently focused knowledge discovery. 
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Knowledge discovery practices shall be such that 
they guide the user to find and also release relevant 
type of information compared to this nominated 
activity phase. This is consistent with Habermas's 
theory of communication act (1984, 1989). He 
claims that to start communication, at least one 
common item must exist between interacting 
parties. Interaction and its development are based 
on this common item. The implication is that to 
conduct interaction between two or more actors, 
one or multiple common categories of information 
must be present. To gain mutual understanding, 
interacting parties require common information 
flows. Knowledge discovery practices shall be 
developed to enable these information flows to 
support users to both release and find relevant 
types of information related to ongoing or becom- 
ing situation. 

Information using profiles differ in the cases 
of evaluating the network-partner potentiality 
for cooperation, preliminary planning work, 
the decision-making itself, as well as managing 
various networking situations. To re-iterate from 
above, at least one information category must be 
common between those functions. In general this 
means that organizations should understand what 
types of information are important for the activities 
between organizations. This can be used to produce 
and develop knowledge discovery practices. Those 
practices should support information exchange 
procedures across organizational boundaries to 
assure the information flow priorities, and to take 
into account the temporal demands of informa- 
tion exchange. 

Information exchange profiles for network- 
ing shall be determined to optimize interactivity. 
Those organizations or parts of organizations that 
are working with the same kinds of issues should 
have common information exchange profiles. 
Networking can be enhanced when information 
content priorities and time frames of updating 
content are consistent across various, inter- 
organizational actors. It can be concluded that to 
develop inter-organizational working processes, it 
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is essential to identify, develop and exploit inter- 
working information exchange profiles. Finding 
and determining those profiles will give a good 
thrust to develop suitable knowledge discovery 
algorithms. 


CONSEQUENCES FOR 
KNOWLEDGE DISCOVERY 


Technological solutions have potential to increase 
organizations capability to perform its task. To 
increase capability even more the practices. 
processes and ways to act shall be renewed or 
adjusted to meet those demands that the holistic 
working environment causes. Further on, organi- 
zational culture shall be adjusted ifsuccess will be 
gained in longer period. If knowledge discovery 
solutions support required processes and practices 
with working phase dependant information dis- 
covery, the overall capability of the organization 
will increase. If information type requirements 
are well known, it will be easier to develop more 
situation precise knowledge discovery solutions 
for organizations needs in networked environment. 
This differs from the content based knowledge 
discovery that aims at finding certain defined 
information, not information requirements to deal 
with certain situation. 

Table 13 combines four different cases of 
information exchange needs of organization 
that is acting in different networking situations 
described in chapter 3. Table 13 demonstrates 
rather clearly that information needs will vary 
from working phase to another. It is assumable 
that this happens despite the purpose of the or- 
ganization. However, some more research shall 
be done to get comparative material from e.g. 
business organizations. Table 13 also shows that 
some information categories are emphasized in 
several phases of activity. Those categories are 
the ones that carry the networking activity from 
phase to another. Moving from partner finding 
and network establishing phase to start common 
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Table 13. The importance of information categories compared to activity phases 


Partnerizing 
Basic assumptions 
Socially true values 
Physically true values 


Social artefacts 


Mission, vision 


Resources 

Action patterns 
Features 

Decision 
Alternatives to act 
Possibilities to act 
Restrictions 


Event model 


Planning published Planning discussed 


[Physically te values | 


Physical artefacts 


— | 
min 
© 


D 
wn 
tam 


Execution 


© 
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activity with planning the information about or- 
ganizations’ means, resources and action patterns 
acts as a bridge to proceed towards next phase. 
Moving from planning to execution information 
about resources and events are crucial to enable 
smooth glide forward. This means that knowl- 
edge discovery practices shall support not only 
in-phase mining but also in-between mining. 
This set flexibility requirements for knowledge 
discovery systems. 

In addition to the type requirements of the in- 
formation the time domain is important, as well. 
When searching for partners the required informa- 
tion shall be available for long periods of time. To 
publish this information is not very time critical. 
Its only criteria are that it is up-to-date, accurate 
and available for possible partners. This informa- 
tion typically changes rather slowly, as well. 


lw |i 
GIS] o 


Information released on briefings prior to 
activity planning and discussions shall be avail- 
able early enough to guarantee sufficient amount 
of time for all networking partners to complete 
their planning before common activity execution. 
However, that information shall not be available 
for too long period, because its reliability will 
expire inevitably during time. When actively 
discussing about planning, execution principles. 
and possibilities of common activity the infor- 
mation shall be available on-line throughout the 
discussion. Same kinds of information availability 
requirement are valid during activity execution 
phase. This time criticality sets requirements for 
knowledge discovery systems, as well. Knowledge 
discovery systems shall be able to both support 
releasing relevant type of information and discover 
right kind of information profile of partners tc 
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answer quickly and reliably to the demands of 
the ongoing working phase. It is rather clearly 
noticeable that different situations set quite di- 
vergent requirements for knowledge discovery 
practices in both quality and type of information 
and in time domain. 

Table 13 illustrates rather clearly that there 
are some key types of information that seem to 
be important in spite of which situation the net- 
working parties are. Those categories are means 
and resources, as well as environment. Other 
frequently occurring categories are action patterns, 
decisions, restrictions, events, alternatives and 
possibilities to act, as well as futures informa- 
tion. However the emphases of those information 
categories vary from using phase to another. This 
has consequences especially how the discovered 
information will be expressed to the user. The 
user interface shall be such that it supports the 
user to find more quickly and accurately such 
information that is relevant for that nominated 
networking phase. This will support the user to 
take account right kind of information and make 
more reliable and accurate choices thus limiting 
the need for both time and bandwidth. 

Knowledge discovery algorithms shall support 
to find information that is not only user content 
specific. Also such terminology and symbolic 
expressions that are specific for differentnetwork- 
ing situations shall be supported by knowledge 
discovery systems. Determining networking 
situation specific ontology that focuses on the 
networking situation phenomena itself may be 
challenging to construct. One example of that was 
introduced in Table 1. The ontology shall support 
the knowledge discovery algorithms to find such 
contents of the published information of very vari- 
ous networking parties that support to find relevant 
partners and doing collaboration in ever evolving 
technologically supported network. This means 
that there shall be both an ontology that focuses 
networking in different phases of activity and an 
ontology that focuses subjective content. In this 
kind of case, knowledge discovery algorithms use 
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both of these to dig both the activity phase relevant 
and content relevant information. This kind of 
knowledge discovery system supports the user by 
offering a platform that both guides the user to 
release situation relevant information and offers 
such content information that is relevant for the 
situation. Its power is in finding such knowledge 
that defines the comprehensive working situation. 
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KEY TERMS AND DEFINITIONS 


Information: Knowledge derived from study, 
experience, or instruction; Knowledge of specific 
events or situations that has been gathered or re- 
ceived by communication; intelligence or news; 
A collection of facts or data; The act of informing 
or the condition of being informed. 

Knowledge: Knowledge is part of the hierar- 
chy made up of data, information and knowledge. 
Data are raw facts. Information is data with con- 
text and perspective. Knowledge is information 
with guidance for action based upon insight and 
experience. 

Knowledge Discovery: Knowledge Discovery 
and Data Mining (KDD) is an interdisciplinary 
area focusing upon methodologies for extracting 
useful knowledge from data. 
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Network: An interconnected system of things 
or people. 

Networking Activity: The act of meeting new 
people in a business or social context by using 
computer networks as enablers. 

Networked Environment: The social-techni- 
cal system, where networking activity takes place. 

Complex Adaptive System: A complex, 
nonlinear, interactive system which has the abil- 
ity to adapt to a changing environment. Such 
systems are characterized by the potential for 
self-organization, existing in a non-equilibrium 
environment. CASs evolve by random mutation. 
self-organization, the transformation of their 
internal models of the environment, and natura! 
selection. 
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